MA336 Statistics

# MA336 Statistics<br /><br />
### Fei Ye <br /><br /> Department of Mathematics and Computer Science<br /><br />
### December 2020

---

<div class="latex-macros">
`$$\require{color}$$`
`$$\definecolor{purple}{RGB}{226, 15, 233}$$`
`$$\definecolor{grey}{RGB}{177, 159, 149}$$`
</div>

# Textbooks

- [Concepts in Statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/)

- [Introductory Statistics](https://open.umn.edu/opentextbooks/textbooks/introductory-statistics)

---
class: center middle topic

# Statistical Studies

---

## Learning Goals for Statistical Studies

- Distinguish between a population and a sample.

- Determine whether a study is an observational study or an experiment.

- Determine the goal of a statistical study and what types of conclusions are appropriate.

- Recognize typical forms of sampling biases such as convenience sample and voluntary response.

- Explain why randomization should be used and describe how to implement a randomized design:  Simple random sample, Stratified random sample, Cluster random sample, Systematic random sample.

- Determine whether the conclusion of an experiment design is appropriate.

---

## The Big Picture

.footmark[
Image source: [Concepts in Statistics (lumen learning)](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/why-it-matters-why-it-matters-types-of-statistical-studies-and-producing-data/)
]

---

## Basic statistical concepts (1/3)

- **Data** consists of information from observation, counts, measurements, responses or experiments.

- A **population** is the collection of all objects that are of interest.

- A **parameter** is a number that is a property of the population.

- A **sample** is a subset of a population.

- A **statistic** is a number, such as a percentage, that represents a property of a sample.

- In statistics, a **variable** is a characteristic, or attribute of interest that we gather about individuals or objects. There are two types of variables according to their values.
  - **Categorical variables** (or qualitative variables) represent attributes, labels or nonnumerical entries, such as names, and colors.
  
  - **Quantitative variables** represent numerical measurements or counts, such as weights and number of students in each class.

---

## Basic statistical concepts (2/3)

- **Example:** Determine if the group is a population or sample

1. The grade of all students in a math class.  
  2. 10 students in a math class earned "A".

- **Answer:**  
  1. Population,  
  2. Sample.

- **Example:** Identify statistic concepts in the following study: To learn the percentage of students go to school by public transportation, 500 students at a college were survey. 50% say they go to school by public transportation

- **Answer:**  
  - Population: all students at the college  
  - Sample: 500 being surveyed  
  - Parameter: unknown percentage  
  - Statistic: 50%

---

## Basic statistical concepts (3/3)

- **Example:** Identify the type variables.

| Variable | Type |
|---|---|
|Age  | *Quantitative* |
|Hair color|   *Qualitative*|
|GPA | *Quantitative* |
| Education attainment (AS, BS, MS, etc.) | *Qualitative* |

---

## Practice: basic statistical concepts

Identify the population, sample, the variable of study, the type of the variable, the population parameter and the sample statistics.

*An administrator wishes to estimate the passing rate of a certain course. She takes a random sample of 50 students and obtains their letter grades of that course. Among the 50 students, 32 students earned a grade C or better.*

---

## Types of statistical studies (1/2)

- A statistical study can usually be categorized as an **observational study** or an **experiment** by the mean of study.

- An observational study observes individuals and measures variables of interest. The main purpose of an observational study is to describe a group of individuals or to investigate an association between two variables.
  
  - An experiment intentionally manipulates one variable in an attempt to cause an effect on another variable. The primary goal of an experiment is to provide evidence for a cause-and-effect relationship between two variables.

---

## Types of statistical studies (2/2)

- **Example:** Which type of study will answer the question.

1. what proportion of all college students in the United States have taken classes at a community college?
  
  2. Does use of computer-aided instruction in college math classes improve test scores?
  
- **Answer:** 1. Observational, 2.experimental

See [Types of Statistical Studies (2 of 4) in the textbook Concepts in Statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/types-of-statistical-studies-2-of-4/) for more examples.

---

## Practice: type of statistical study

Identify the type of statistical study:

1. *A study took random sample of adults and asked them about their bedtime habits. The data showed that people who drank a cup of tea before bedtime were more likely to go to sleep earlier than those who didn't drink tea.*

A. Observational  
    B. Experimental

2. *Another study took a group of adults and randomly divided them into two groups. One group was told to drink tea every night for a week, while the other group was told not to drink tea that week. Researchers then compared when each group fell asleep.*

A. Observational  
    B. Experimental

.footmark[
Source: [Khan Academy](https://www.khanacademy.org/math/probability/study-design-a1/observational-studies-experiments/a/observational-studies-and-experiments)
]

---

## Questions about population (1/2)

| **Type of Research Question**                                | **Examples**                                                 |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| **Make an estimate about the population** (often an estimate about an *average* value or a *proportion* with a given characteristic) | What is the *average* number of hours that community college students work each week?   What *proportion* of all U.S. college students are enrolled at a community college? |
| **Test a claim about the population** (often a claim about an *average* value or a *proportion* with a given characteristic) | Is the *average* course load for a community college student greater than 12 units?   Do the *majority* of community college students qualify for federal student loans? |

---

## Questions about population (2/2)

| **Type of Research Question**                                | **Examples**                                                 |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| **Compare two populations** (often a comparison of population averages or proportions with a given characteristic) | In community colleges, do female students have a *higher* GPA than male students?   Are college athletes *more* likely than non-athletes to receive academic advising? |
| **Investigate a relationship** between two variables in the population | Is there a *relationship* between the number of hours high school students spend each week on Facebook and their GPA?   Is academic counseling *associated* with quicker completion of a college degree? |

---

## Question on cause-and-effect (1/2)

- A research question that focuses on a cause-and-effect relationship is common in disciplines that use experiments, such as medicine or psychology.
  - Does cell phone usage increase the risk of developing a brain tumor?
  - Does drinking red wine lower the risk of a heart attack?

- In a study of a relationship between two variables, one variable is the **explanatory variable**, and the other is the **response variable**.

- To establish a cause-and-effect relationship, we want to make sure the explanatory variable is the only thing that impacts the response variable.

- We therefore get rid of all other factors that might affect the response. These factors are called **confounding variables**. For example, taking a medicine could be a confounding variable in the second question above.

---

## Question on cause-and-effect (2/2)

**Example:** Determine if the question is a cause-and-effect question? What are the explanatory and response variables?

1. Does use of computer-aided instruction in college math classes improve test scores?
  2. Does tutoring correlate with improved performance on exams?

**Answer:**

1. This question investigate a cause-and-effect relationship. The explanatory variable is computer-aided instruction and the response variable is the test scores.

2. This question investigate a correlation between variables in a population and is not a cause-and-effect question. The explanatory variable is tutoring and the response variable is the performance.

---

## Appropriate conclusions of a study (1 of 2)

- ***In general, we should not make cause-and-effect statements from observational studies unless impact of confounding variables can be significantly decreased.***

- **Example:** A researcher studies the medical records of 500 randomly selected patients. Based on the information in the records, he divides the patients into two groups: those given the recommendation to take an aspirin every day and those with no such recommendation. He reports the percentage of each group that developed heart disease.

Determine whether the study supports the conclusion that taking aspirin lowers the risk of heart attacks.

- **Answer:** The conclusion claims a cause-and-effect relationship. To answer the question, we need an experimental study. However, the study has no control on data which makes it inappropriate.

---

## Practice: cause-and-effect

Does higher education attainment lead to higher salary?

1. Determine if the question is a cause-and-effect question?  
2. What are the explanatory and response variables?  
3. If a student want to study this question, what type of statistical study can be used? What kind of conclusion can be drawn?

---

## Sampling plans

To make accurate inference, the sample must be representative of the population.

- A **sampling plan** describes exactly how we will choose the sample.

- A sampling plan is **biased** if it systematically favors certain outcomes.

- In **random Sampling**, every individual or object has an equal chance of being selected.

---

## Methods of random sampling (1/2)

- **Simple random sample**: groups of the same size are randomly selected. Table of random numbers, calculator and softwares are often used to generate random numbers.
.center[ ![Random Table](data:image/png;base64,#Figures/simple-random-sample.png)]

- **Stratified random sample**: The population is first split into groups. Then subjects from each group are selected randomly.
.center[ ![Stratified Sample](data:image/png;base64,#Figures/Stratified-Random-Sample.png)]

???
Show how to generate a random number using the Excel function **RANDBETWEEN()**
In the latest version of Excel, a new function **RANDARRAY()** is available.

---

## Methods of random sampling (2/2)

- **Cluster sample**: The population is first split into groups. Then some groups are selected randomly.
.center[ ![Cluster Sample](data:image/png;base64,#Figures/Cluster-Sample.png)]

- **Systematic sample**: First, a starting number is chosen randomly. Then take every `$n$`-th piece of the data.
.center[ ![Systematic Sample](data:image/png;base64,#Figures/Systematic-Sample.png)]

---

## Practice: sampling methods

Determine the type of sampling method.

1. A market researcher polls every tenth person who walks into a store.

2. 100 students whose student id numbers matches 100 numbers generated by a computer randomization program.

3. The first 30 people who walk into a sporting event are polled on their television preferences.

---

## Bad sampling

- Biased sampling
  - Online polls. These are examples of a voluntary response sample.
  - Mall surveys. These  are an example of a convenience sample.

[See Sampling (1 of 2) in the textbook for examples](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/sampling-1-of-2/)

- Undercoverage

- It occurs when some groups in the population are left out of the process of choosing a sample. For example, random survey math students to estimate the average GPA or a college.

---

## Appropriate sampling design

- **Example**: Suppose that you want to estimate the proportion of students at your college that use the library.

Which sampling plan will produce the most reliable results?
  
  1. Select 100 students at random from students in the library.
  
  2. Select 200 students at random from students who use the Tutoring Center.
  
  3. Select 300 students who have checked out a book from the library.
  
  4. Select 50 students at random from the college.

- **Answer:** The 4th sampling plan is the most reliable plan. The first three and undercover the college.

In general, the larger sample size, the more accurate of conclusion. However, we have to avoid bad sampling.

---

## Elements of experimental design (1/2)

- **Control** reduces the effects of extraneous variables (i.e., variables other than the independent variable and the dependent variable). These extraneous variables are called lurking variables.

- Three control strategies are control groups, placebos, and blinding.

- A **control group** is a baseline group that receives no treatment or a neutral treatment.

- A neutral treatment that has no "real" effect on the dependent variable is called a **placebo**, and a participant's positive response to a placebo is called the **placebo effect**.

- **Blinding** is the practice of not telling participants whether they are receiving a placebo. **Double-blinding** is the practice of not telling both  both the participants and the researchers which group receiving a treatment or a placebo.

---

## Elements of experimental design (2/2)

- **Randomization** ensures that this estimate is statistically valid.

- With random assignment, we can be fairly confident that any differences we observe in the response of treatment groups is due to the explanatory variable.

- **Replication** reduces variability in experimental results and increases their significance.

- Although randomization helps to insure that treatment groups are as similar as possible, the results of a single experiment, applied to a small number of objects or subjects, should not be accepted without question.  
  - Any good experiment should be reproducible, and in particular, replication should yield similar results.

???
Confounding variable vs lurking variable

- A confounding variable has at least a partial effect on the response variable.

- **Example:** In the study of the relation between a type fertilizer and tomato size, the amount of sunshine will be a confounding variable. It contributes to the growth of tomato.

- A lurking variable has an effect on both the explanatory and the response variables.

- **Example:** People find that there is a positive association between number of firefighters and amount of damage. However, both are affected the size of fire.

---

## Practice: experimental design

There is an ongoing debate about how many spaces should be placed after a period in typed documents. Alana read about a study where 100100100 participants all read the same document typed in Courier New font. Half of the participants were randomly assigned the document with one space after each period, and the other half were given the document with two spaces after each period.

Participants who read the document with two spaces after each period were able to finish reading significantly faster than those with one space after each period. Alana concluded that using two spaces after each period will help people read all documents faster.

Is this study appropriate? Why?

.footmark[
Source: [Khan Academy](https://www.khanacademy.org/math/ap-statistics/gathering-data-ap/statistics-experiments/e/issues-experiments)
]

---

## Lab: Random numbers by Excel (1/3)

- **Example:** Randomly generate a number between 0 and 1.

- Step 1: Choose a cell, say `A1`
  
  - Step 2: click insert function button `$f_x$`.
  
  - Step 3: In the popup window, search "random" and select **RAND**.
  
  - Step 4: Click OK, you will get a randomly generated number.

Alternatively, you may also manually enter the function: `=rand()` in the cell and hit enter.

---

## Lab: Random numbers by Excel (2/3)

- **Example:** Generate 10 random integers of 2 digits.

- Step 1: Generate a random integer, say in the cell `A1`, using the Excel function `randbetween(bottom,top)`.
  
  - Step 2: Move the mouse cursor to the lower right corner of the cell `A1`. A solid plus `+` will appear.
  
  - Step 3: Hold the left-click of the mouse and drag the cell to horizontally or vertically to get 10 numbers.

Using `randbetween`, you will find that some numbers were repeated. If you are using the latest version of Excel, you may use `randarray` to generate numbers without repetition.

---

## Lab: Random numbers by Excel (3/3)

- **Example:** Generate 10 random integers of 2 digits without repetition.

- In the cell with 10 empty cells below it, say `A1`, apply the Excel function `=randarray(rows,columns,min,max,integer)`. In this case, you should set rows=10, columns=1, min=10, max=99, and choose integer to be TRUE.

---

## Table in Excel

- **Insert a table**

- Step 1: In the menu bar, select Insert.
  
  - Step 2: Look for Table and click it.
  
  - Step 3: In the popup windows, you may enter the two diagonal cell locators. For that, press Shift and select the two diagonal cells of you table.
  
  - Step 4: Click OK. You will see the table.

- **Remark:** Tables are normally used for more than one variables, that are characteristic or attributes being studied, such as attendance rate and grade. In table, a column is usually used to put entries of a data set for a certain variable. Rows are used as labels of individual entries.

---

## Insert or delete cells

- **Insert or delete cells, rows or columns**

- Step 1: Highlight by left clicking the cell(s), row, or column that you want to insert or delete.
  
  - Step 2: Right-click the highlighted cell, row or column
  
  - Step 3: In the popup window, select insert or delete and follow the instruction.

---

## Install the Analysis ToolPak

- We will use analysis toolpak frequently for analyzing data.

- To install the add-in `The Analysis ToolPak`:

- Step 1: In the Excel menu bar, select Home.
  
  - Step 2: Choose and click options
  
  - Step 3: In the popup window, choose and click Add-ins.
  
  - Step 4: In the new display, look for Manage: Excel Add-ins and click Go next to it.
  
  - Step 5: In the new popup windows, select The Analysis ToolPak and then click the OK button.

---
class: center middle topic

# Summarizing Data Graphically

---

## Learning Goals for Summarizing Data Graphically

- Create and interpret graphs (dot plots, pie charts, histograms, or boxplots) as a means of summarizing and communicating data meaningfully.

- Calculate and explain the purpose of measures of location (mean, median), variability (standard deviation, interquartile range).

- Explain the impact of outliers on summary statistics such as mean, median and standard deviation.

---

## Distribution of Quantitative Data

- In data analysis, one goal is to describe **patterns** (known as the **distribution**) of the variable in the data set and create a useful summary about the set.

- To describe patterns in data, we use descriptions of **shape**, **center**,
and **spread**. We also describe exceptions to the pattern. We call these exceptions **outliers**.

.center[
  ![:resize Concepts used in the description of a distribution, 70%](data:image/png;base64,#Figures/Graph-of-Distribution.png)
]

---

## Dot Plots

- A **dot plot** includes all values from the data set, with one dot for each occurrence of an observed value from the set.

**Example:** The data set contains 15 petal lengths of iris flower. Create a dot plot to describe the distribution of petal lengths.

**Solution:** For each number in the data set, we draw a dot. We stack dots of the same value from bottom to up.

.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-3-1.png" width="432" />
]

---

## Practice: Heights Of Cherry Trees

The data set contains the heights of 20 Black Cherry Trees. Create a dot plot to describe the distribution of the heights.

---

## Pie Charts (1/2)

- A **pie chart** is a pie with sectors represents categories and the area of each sector is proportional to the frequency of each category.  
  - The **frequency** of a category is the number of occurrences of elements in the category.
  - The proportion of a frequency to the size of the population or the sample is also called the **relative frequency**.

**Example:** The counts of majors of 100 students in a sample are shown in the table. Use a pie chart to organize the data.

.center[
| Grade | Frequency (Counts) |
| ----- | ------ |
| Art     | 30      |
| Engineering | 50      |
| Science    | 20      |
]

---

## Pie Charts (2/2)

**Solution:**

- Find the relative frequency (percent) of each grade is shown in the following table.

.center[
| Major       | Frequency| Relative Frequency|
| ----------- | -------------------- | ------------------------------- |
| Art         | 30                   | 30%                             |
| Engineering | 50                   | 50%                             |
| Science     | 20                   | 20%                             |
]

- The following shows the pie chart.  
.center[
  ![:resize A pie chart, 35%](data:image/png;base64,#Figures/pie-chart.svg)
]

---

## Practice: Passengers on Titanic

The following data table summarize passengers on Titanic. Using a pie chart to describe the data table.

| Class | Passengers |
|:-----:|:----------:|
|  1st  |    325     |
|  2nd  |    285     |
|  3rd  |    706     |
| Crew  |    885     |

---

## Histograms

- A **histogram** divides values of a variable into equal-sized intervals called **bins**  (classes in some books) and uses rectangular bars to show the **frequency (count)** of observations in each interval.

- A **frequency distribution** is a table which contains bins, frequencies and/or **relative frequencies** which are  proportions (percentage) defined by the formula
  $$
    \text{Relative frequency} =\frac{\text{Class frequency}}{\text{Sample size}}.
  $$

- Each bin has a **lower bin limit**, which is the left endpoint of the interval, and an **upper bin limit**, which is the right endpoint of the interval.
  
- The **bin width** is the distance between the lower (or upper) bin limits of two consecutive bins.

- The difference between the maximum and the minimum data entries is called the **range**.
  
- The **midpoint** of a bin is the half of the sum of the lower and upper limits of the bin.

???
Dot plots work well with small data sets. Because, each data entry is a bin that contains all entries with the same value.
  
---

## Example: Histogram of mpg (1 of 2)

The following data set show the mpg (mile per gallon) of 30 cars.  Construct a frequency table and frequency histogram for the data set using a bin width 4.

.center[
21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7
]

**Solution:**

- Find the maximum, minimum and range of the data set. In this example, the minimum is 10.4
, the maximum is 33.9, and the range is 33.9-10.4=23.5

- Determine the number bins using the roundup of `$\frac{\text{range}}{\text{bin width}}$`. In this example, the number of bins is `$\lceil \frac{23.5}{4}\rceil=6$`.

- Choose a starting point as the first lower bin limit. A **convenient starting point** is a value less than the minimum that has *more accuracy than the data set*. For example, in this data set, we may start with 10.35, then add the bin width to get all lower bin limits: 10.35, 14.35, 18.35, 22.35, 26.35, and 30.35.

---

## Example: Histogram of mpg (2 of 2)

**Solution:**(continued)

- The upper bin limit can be taken as the next lower bin limit. In this example, the upper bin limits can be taken as 14.35, 18.35, 22.35, 26.35, 30.35 and 34.35.

- Record counts in bins and create the frequency distribution table.

- Graph the histogram using the frequency distribution table.

.pull-left[
<table>
 <thead>
  <tr>
   <th style="text-align:center;"> Bin </th>
   <th style="text-align:center;"> Frequency </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 10.35-14.35 </td>
   <td style="text-align:center;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 14.35-18.35 </td>
   <td style="text-align:center;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 18.35-22.35 </td>
   <td style="text-align:center;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 22.35-26.35 </td>
   <td style="text-align:center;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 26.35-30.35 </td>
   <td style="text-align:center;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 30.35-34.35 </td>
   <td style="text-align:center;"> 4 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-8-1.png" width="432" />
]

---

## Some Remarks on Histogram

- Avoid histograms with large bin widths and small bin widths. [See Histogram 2 of 4 in Concepts in Statistics for an interactive demonstration](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/histograms-2-of-4/)

- When bin width is no given, we may first determine the number of bins. There are different approaches. For example, the Rice rule takes the bin number `$k = \lceil 2n^{1/3}\rceil$`, where `$\lceil 2n^{1/3}\rceil$` is the roundup of `$2n^{1/3}$`.

- If the number of bins is `$k$`, then we choose a number with the same or one more decimal place that is greater than `$\frac{\text{range}}{k}$`, but no more than `$\frac{\text{range}}{k-1}$` as the bin width.

- The area of a bar represents the relative frequency for the bin. There should no space between any two bars.

- **Bar charts** are usually used to compare data sets from different categories. Histogram should not be bar chart.

- See the [Statistic How To](https://www.statisticshowto.datasciencecentral.com/choose-bin-sizes-statistics/) page for more discussion on choosing bin width.

???
Show a bar chart to students in Excel.

---

## Practice: Petal lengths of irises

The following data set show the petal length of 20 irises. Construct a frequency table and frequency histogram for the data set using 6 bins.

.center[
1.4, 5.4, 1.2, 4.5, 6.1, 1.5, 4.7, 1.4, 5.6, 5.2, 1.3, 6.3, 5.1, 5.6, 5, 6.7, 1.4, 1.6, 1.5, 1.5
]

---

## Common Descriptions of Shape Distribution

- **Right skewed** (reverse `$J$`-shaped):  A right-skewed distribution has a lot of data at lower variable values. (Example: the histogram example.)

- **Left skewed** ({ `$J$`-shaped):  A left skewed distribution has a lot of data at higher variable values with smaller amounts of data at lower variable values.

- **Symmetric with a central peak (bell-shaped)**: A
central peak with a tail in both directions. A bell-shaped distribution has a lot of data in the center with smaller amounts of data tapering off in each direction. (Example: the petal length example.)

- **Uniform**: A rectangular shape, the same amount of
data for each variable value.

- For examples of left skewed and uniform distributions, please see the example in [dotpolt 2 of 2 in Concepts in Statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/dotplots-2-of-2/)

---

## Practice: Shapes of Distributions

Statistics are used to compare and sometimes identify authors. The following lists shows a simple random sample that compares the letter counts for three authors.

Terry: 7, 9, 3, 3, 3, 4, 1, 3, 2, 2

Davis: 3, 3, 3, 4, 1, 4, 3, 2, 3, 1

Maris: 2, 3, 4, 4, 4, 6, 6, 6, 8, 3

Create a dot plot for each sample and describe the shape of the distribution of each sample.

.footmark[
Source: [Example 2.7.1, OpenStax Introductory Statistics](https://stats.libretexts.org/Bookshelves/Introductory_Statistics)
]

---

## Mean and Median for Distributions in Different Shapes

## Practice: Choose Appropriate Measure of Center

A student survey was conducted at a major university. The following histogram shows distribution of alcoholic beverages consumed in a typical week.
1. What is the typical number of drinks a student has during a week?
2. Do the data suggest that drinking is a problem in this university?
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-10-1.png" width="864" />
]
.footmark[*The red line is over the median and the blue line is over the mean.*]

---

## Lab: Create Frequency Tables (1 of 2)

In Excel, to create a frequency table for a data array, we need a bin array which is used to split the date set into smaller intervals. The values in a bin array in Excel are (upper) boundaries of intervals. With a data array and a bin array, we can use the Excel function `FREQUENCY (data_array, bins_array)` to create a frequency table.

Suppose the data set is in column A and the bin array is in column B. Here is how to create a frequency table using the function `FREQUENCY (data_array, bins_array)`:

1. In column C, right to the smallest value of the bin array enter `=FEQUENCY(` 
2. select the data values
3. in the formula bar, enter the symbol comma `,`
4. select the bin array
5. in the formula bar, enter `)`.

Hit the `Enter`, you will get a frequency table.

---

## Lab: Creating Charts in Excel

Excel has many built-in chart functions. To create a charts,

1. Select the data array/table
2. Under the `Insert` tab, click on an appropriate chart in the `Charts` command set.

The appearance of chart can be changed after being created.

---

## Lab: Create Histogram Charts in Excel (1 of 3)

1. Select the data

2. On the `Insert` tab, in the `Charts` group, from the `Insert Statistic Chart` dropdown list, select `Histogram`:

**Note:** The histogram contains a special first bin which always contains the smallest number. This is different from many textbooks.

---

## Lab: Create Histogram Charts in Excel (2 of 3)

To **format the histogram chart** is similar to format a Pie chart. For example, you can change bin width from `Format Axis`.

1. Right-click on the horizontal axis and choose `Format Axis` in the popup menu:

2. In the `Format Axis` pane, on the `Axis Options` tab, you may try different options for bins.

---

## Lab: Create Histogram Charts in Excel (3 of 3)

**Remark:**

- Excel using a different convention to create histogram. The first bin is a closed interval and other bins are left open and right closed intervals.

- Select the **Overflow bin** checkbox and type the number, all values above this number will be added to the last bin.

- Select the **Underflow bin** checkbox and type the number, all values below and equal to this number will be added to the first bin.
- Histograms show the shape and the spread of quantitative data. For categorical data, discrete by its definition, bar charts are usually used to represent category frequencies.

---

## Lab: Create Histogram Charts in Excel using the `Analysis ToolPak`

Suppose your data set is in `Column A` in Excel.

- In the cell `B1`, put the *first lower bin limit*, which is a number slightly less than the minimum but has more decimal places than the data set.

- Create upper bin limits in column C.

- In Data menu, look for the Data Analysis ToolPak (if not, go to File > Options > Add-ins > Manage Excel Add-ins, check Analysis ToolPak). In the popup windows, find Histogram.

- In the input range, select your data set. In the bin range, select upper bins.

- Check Chart Output and hit OK. You will see the frequency table and histogram in Sheet 2.

- Change the gap between bars. Right click a bar and choose `Format Data Series...` and change the `Gap Width` to 2% or 1%.

---

## Lab: How to Create a Dotplot in Excel

- If you have a raw data set, follow the same procedure a creating a histogram but with a bin width equal the same accuracy of the data. For example, if you data set consists of integers, then choose 1 as the bin-width.

- Change the format of bars in the histogram.

- Right click a bar and select `Format Data Series...`.
  
  - Find `Fill & Line` and select both `Picture or texture fill` and `Stack and Scale with`.
  
  - Click the button `Oneline...` and input *dot* in `search bing` and hit enter.
  
  - Select a picture you like and you will get a dot-plot.

---

## Lab: Practice

Use Excel to complete the following tasks:

1. Create a random sample of 30 two-digit integers.

2. Create a histogram with 6 bins for the sample of 30 two-digit integers.

3. Create a dot plot for the sample of 30 two-digit integers.

Describe the shape of the distribution of the sample of 30 two-digit integers.

---
class: center middle topic

# Measure of Centeral Tendency and Spread

---

## Learning Goals for Measure of Centeral Tendency and Spread

- Create and interpret graphs (dot plots, pie charts, histograms, or boxplots) as a means of summarizing and communicating data meaningfully.

- Calculate and explain the purpose of measures of location (mean, median), variability (standard deviation, interquartile range).

- Explain the impact of outliers on summary statistics such as mean, median and standard deviation.

---

## Measure of Centers

- **Mean**: The mean is the average, this is the quotient of the total sum by the total number.

- **Median**: The median is the middle of
the data when all the values are listed in order. The median divides the data into
two equal-sized groups.

- Use the *mean* as a measure of center only for distributions that are *reasonably symmetric* with a central peak. When outliers are present, the mean is not a good choice.

- Use the *median* as a measure of center for all *other cases*.
  
- We need to use a graph to determine the shape of the distribution. So graph the data first.

- Check the webpage on [Skewness of Relative Frequency Histograms](https://saylordotorg.github.io/text_introductory-statistics/s06-02-measures-of-central-location.html) to see the positions of mean and median.

---

## Notations and Calculations about Mean

- Sigma notation: in math, we denote the sum of values  `$x_1$`, `$x_2$`, `$\dots$`, `$x_n$` of a variable `$x$` by `$\sum_{i=1}^n x_i$` or simply by `$\sum x$`.

- The **population mean** is `$\mu= \frac{\sum x}{N}$`, where `$N$` is the **population size**, i.e the number of elements in the population.

The notation `$\mu$` reads as mu.

- The **sample mean** is `$\bar{x}=\frac{\sum{x}}{n}$`, where `$n$` is the **sample size**. The notation `$\bar{x}$` reads as `$x$`--bar.

---

## Example: Mean city mpg

Find the mean city mpg for a sample of 10 cars.
.center[
18, 21, 20, 21, 16, 18, 18, 18, 16, 20
]

**Solution:** The mean is

`$$\bar{x}=\frac{18+21+20+21+16+18+18+18+16+20}{10}=18.6.$$`

The mean mpg of the 10 cars is 18.6 mpg.

---

## Weighted Mean

- The weighted mean of a set of numbers `$\{x_1, \dots, x_n\}$` with weights `$w_1$`, `$w_2$`, ..., `$w_n$` is defined as  `$$\frac{\sum w_ix_i}{\sum w_i}.$$`

- The mean of a frequency table is weighted mean `$\bar{x}=\frac{\sum f x}{n}$`, where `$x$` is an element with frequency `$f$` and `$n$` is the sample size.

---

## Example: Course overall grade

In a course, the overall grade is determined in the following way: the homework average counts for 10%, the quiz average counts for 10%, the test average counts 50% , and the final exam counts for 30%. What's the overall grade of the student who earned  92 on homework, 95 on quizzes, 90 on tests and 93 on the final.

**Solution:** The overall grade is the weighted mean

`$$\frac{\sum w_ix_i}{\sum w_i}=\frac{0.1\cdot 92+0.1\cdot 95+0.5\cdot 90+0.3\cdot 93}{0.1+0.1+0.5+0.3}=91.6.$$`

???
Show how to use Excel

---

## Practice: Mean petal width

Find the average petal width for a sample of  10 iris followers.
  
.center[
1.1, 0.2, 0.2, 1.2, 1.3, 0.2, 1.5, 1.9, 1.5, 1.8
]

---

## Practice: Calculate a mean using the weighted mean formula

Find the mean from the dot plot of sepal length for a sample of 10 iris flowers.
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-13-1.png" width="576" />
]

---

## Practice: Estimate the mean from a histogram

Estimate the average highway mpg using the histogram of a sample of 20 cars.

.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-14-1.png" width="576" />
]

---

## Practice: Weighted mean - calculate final grade

---

## Median, Quartiles, Interquartile Range and Outliers

- The three **quartiles**, Q1, Q2, and Q3 are numbers in an ordered data set that divide the data set into four equal parts. The second quartile is known as the **median**.

- **Interquartile Range (IQR for short)** is the measure of variation when using the median to measure center. It is defined as the difference of the third and the first quartiles: IQR=Q3-Q1.

- When the center and the spread are measured by the median and the IQR, a value in the data is considered an **outlier** if the value is
  - greater than Q3 + 1.5 `$\cdot$` IQR or
  - less than Q1 − 1.5 `$\cdot$` IQR.

**Note:** An outlier in this definition is also called a **mild outlier**. An outlier that is more extreme than Q1 + 3 `$\cdot$` IQR or Q3 - 3 `$\cdot$` IQR is also called **extreme outlier**.

- The minimum, Q1, Q2, Q3 and maximum are known as the "**five-number summary**" of the data set.

- The difference of maximum and minimum is called the **range**.
  
---

## Example: Median, IQR and Outliers

Find the median, quartiles, IQR and outliers (if they exist) of the sample height of 15 trees.

**Solution:**

- Sort the data set from small to large.
.center[
63, 65, 66, 69, 70, 72, 75, 75, 75, 76, 76, 79, 80, 81, 83
]
- Find the median i.e. Q2. The sample size is 15. The middle of the ordered data set is the `$\lceil 15/2 \rceil=8$`-th number which is 75.
- Find Q1 and Q3. Q1 is the median of the numbers less than the median. Q3 is the median of the number greater than the median. In this example, Q1 is the 4-th number 69. Q3 is the 4-th to the last, that is 79.
- IQR=Q3-Q1=79=69=10.
- Since Q1-1.5IQR=69-1.5 `$\cdot$` 10=54 and Q3+1.5IQR=79-1.5 `$\cdot$` 10=94, there is no outlier in this sample.

---

## Practice: Range and IQR

---

## Box Plot

- A **box plot** shows a "five-number summary" of the data set. It contains a box, two whiskers and dots (for outliers).

- To create the boxplot for a distribution,
  
  - Draw a box from Q1 to Q3.
  
  - Draw a vertical line in the box at the median.
  
  - Extend a tail from Q1 to the smallest value that is not an outlier and from Q3 to the largest value that is not an outlier.
  
  - Indicate outliers with a solid dot.

---

## Example: Box plot - ages of best oscar winners

Create the boxplot for the ages of 32 best actor oscar winners (1970–2001).

.center[
31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76
]

**Solution:** We may use Excel to find the five-number summary.

- Q2=42.5, Q1=37.5, Q3=49.5, IQR=12, 1.5IQR=18, Q1-1.5IQR= 19.5, Q3+1.5IQR=67.5

- The smallest number that is not an outlier is 31. The largest number that is not an outlier is 61. Those two numbers bounds wiskers.

- There is an outlier 76.

- The boxplot is shown below.
  .center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-18-1.png" width="720" />
]

---

## Practice: the five number summary and the boxplot

---

## Measure of Variation about Population Mean

- The **deviation** of an entry `$x$` in a population data set is the difference `$x-\mu$`, where `$\mu$` is the mean of the population.
  
- The **population variance** of a population of `$N$` entries is defined as
  $$
    \text{VAR.P}=\sigma^2=\dfrac{\sum(x-\mu)^2}{N}.
  $$

- The **population standard deviation** is
  $$
    \text{STDEV.P}=\sigma=\sqrt{\dfrac{\sum(x-\mu)^2}{N}}.
  $$

---

## Measure of Variation about Sample Mean

- The **deviation** of an entry `$x$` in a sample data set is the difference `$x-\bar{x}$`, where `$\bar{x}$` is the mean of the sample.

- The **sample variance** and **sample standard deviation** are defined similarly
  $$
    \text{VAR.S}=s^2=\dfrac{\sum(x-\bar{x})^2}{n-1}, \qquad
    \text{STDEV.S}=s=\sqrt{\dfrac{\sum(x-\bar{x})^2}{n-1}},
  $$
  where `$n$` is the sample size.

- **Rounding rule:** for mean, variance and standard deviation, we keep one more digits than the accuracy of the data set.

**Note:** To measure the spread, one may also use the **mean absolute deviation**
`$$MAD=\dfrac{\sum |x-\bar{x}|}{n}.$$`
However, the standard deviation has better properties in applications.

???
Show how to use Excel to find SD

---

## Example: Standard deviation - ages of oscar winners

Find the mean and standard deviation ages of a sample of  32 best actor oscar winners (1970–2001).

.center[
31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76
]

**Solution:** We use the Excel functions `AVERAGE()` and `STDEV.S()` to find the mean and sample standard deviation respectively.
The mean is 44.7. The sample standard deviation is 10.3.
<iframe scrolling="no" title="" src="https://www.geogebra.org/material/iframe/id/DS6PUaXy/width/1300/height/800/border/888888/sfsb/true/smb/false/stb/false/stbh/false/ai/false/asb/false/sri/false/rc/false/ld/false/sdz/false/ctl/false" width="120%" height="35%" style="border:0px;"></iframe>

---

## Practice: Standard deviation

A *sample* of GPAs from ten students random chosen from a college are recorded as follows.
.center[1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.33]

Find the standard deviation of this sample.

---

## Mean and Standard Deivation under Linear Transformation

- When we increase values in a data set by a fixed number `$c$`, the standard deviation of a data set won't change. However, the mean increases by `$c$` too.

- When we multiple values in a data set by a factor `$k$`, the mean and the standard deviation both scale by the factor `$k$`.

## Effect of Changes of Data on Statistical Measures

---

## Practice: Standard deviation under a transformation

A sample of the highest temperature of 10 days has a standard deviation `$5^\circ\mathrm{C}$` in Celsius.

1. If we want to know the standard deviation in Feirenheit, do we need to recaculate using the sample?

2. What is the standard deviation in Fahrenheit.
  
---

## The Empirical Rule

If a data set has an **approximately bell-shaped** distribution, then

1. approximately 68% of the data lie within one standard deviation of the mean.

2. approximately 95% of the data lie within two standard deviations of the mean.

3. approximately 99.7% of the data lies within three standard deviations of the mean.

.footmark[
Image source: [Figure 2.16 "The Empirical Rule"  in Introductoray Statistics](https://saylordotorg.github.io/text_introductory-statistics/s06-05-the-empirical-rule-and-chebysh.html#fwk-shafer-ch02_s05_s01_f02)
]

---

## Chebyshev’s Theorem

For any numerical data set, at least `$1−1/k^2$`
of the data lie within `$k$` standard deviations of the mean, where `$k$` is any positive whole number that is at least 2.

.footmark[
Image source: [Figure 2.19 "Chebyshev’s Theorem"  in Introductoray Statistics](https://saylordotorg.github.io/text_introductory-statistics/s06-05-the-empirical-rule-and-chebysh.html#fwk-shafer-ch02_s05_s02_f01)
]

---

## Example: Applications of the Empirical Rule

A population data set with a bell-shaped distribution has mean `$\mu = 6$` and standard deviation `$\sigma = 2$`. Find the approximate proportion of observations in the data set that lie:

1. between 4 and 8;
2. below 4.

**Solution:** Apply the Empirical Rule, there are 68% of data lie between 6-2=4 and 6+2=8. Since the distibution is symmetric, then 34% of data lie between 4 and 6, and 34% of data lie between 6 and 8. Then there are only 50%-34%=26% of data lie below 4.

---

## Example: Applications of Chebyshev's Theorem

A sample data set has mean `$\bar{x}=6$`
and standard deviation `$s = 2$`. Find the minimum proportion of observations in the data set that must lie
between 2 and 10.

**Solution:** Apply Chebyshev's theorem, there are 75% of data are between `$\bar{x}-2s=2$` amd `$\bar{x}+2s=10$`.

---

## Practice: The empirical rule

<!-- A population data set with a bell-shaped distribution has mean `$\mu=2$` and standard deviation `$\sigma=1.1$`. Find the approximate proportion of observations in the data set that lie above 3.1.

.footmark[
    Source: [2.5 The Empirical Rule and Chebyshev’s Theorem in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s06-05-the-empirical-rule-and-chebysh.html#fwk-shafer-ch02_s05_s01_f02).
] -->

---

## Practice: Chebyshev’s Theorem

A sample data set has mean `$\bar{x}=10$` and standard deviation `$s = 3$`. Find the minimum proportion of observations in the data set that must lie between 1 and 19.

---

## Practice: Change of Measures on Transformation of Data

A teacher decide to curve the final exam by adding 10 points for each student. Which of
the following statistic will NOT change:  
A. median,   B. mean,   C. interquartile range,   D. standard deviation?  
**Please explain your conclusion.**

---

## Practice: Understand Standard Deviation From Graphs

Which distribution of data has the SMALLEST standard deviation? Please explain your conclusion.

.center[
![Distributions with different standard deviation](data:image/png;base64,#Figures/SD-Pic.png)
]

---

## Lab: How to Find the Mean, Median, Quartiles and Standard Deviation

- To find the mean, you may use the function `AVERAGE()`.

- To find the median, you may use the function `MEDIAN()`.

- To find quartiles, you may use the function `QUARTILE.EXC`.

- To find the **p**opulation standard deviation, you may use the function `STDEV.P()`.

- To find the **s**ample standard deviation, you may use the function `STDEV.S()`.

---

## Lab: How to Create a Boxplot in Excel

- Select your data—either a single data series, or multiple data series.

- Click `Insert` > `Insert Statistic Chart` > `Box and Whisker` to create a boxplot.
  
For more information, see [Create a box and whisker chart in Excel 365](https://support.microsoft.com/en-us/office/create-a-box-and-whisker-chart-62f4219f-db4b-4754-aca8-4743f6190f0d)

---

## Lab: Practice - Car Speeds

Consider the following sample that consists of speeds of 20 cars.
.center[

```
# NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
```
]

Use Excel to answer the following questions

1. Find the mean, median, quartiles and standard deviation of the sample.
2. Create a boxplot to describe the data in the sample.

---
class: center middle topic

# Linear Relationship

---

## Learning Goals for Linear Regressions

- Summarize and interpret the relationship between two quantitative variables.

- Demonstrate understanding of concepts pertaining to linear regression.

- Use regression equations to make predictions and understand its limits.

---

## Scatterplots (1/5)

- Correlation refers to a relationship between two quantitative variables:  
  - the independent (or explanatory) variable, usually denoted by `$x$`.
  
  - the dependent (or response) variable, usually denoted by `$y$`.

- **Example:** In a study of education attainment and annual salary, the years of education is the explanatory variable and the annual salary is the response variable.

- To describe the relationship between two quantitative variables, statisticians use a scatterplot.

- In a scatterplot, we describe the overall pattern with descriptions of direction, form, and strength.

---

## Scatterplots (2/5)

.pull-left[
- **Positive relationship**: the response variable (y) increases when the explanatory variable (x) increases.
  
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-20-1.png" width="288" />
]
]
.pull-right[
- **Negative relationship**: the response variable (y) decreases when the explanatory variable (x) increases.
  
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-21-1.png" width="288" />
]
]

---

## Scatterplots (3/5)

.pull-left[
-  **Linear form** 
  
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-22-1.png" width="198" />
]
]
.pull-right[
- **Curvilinear form**
  
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-23-1.png" width="198" />
]
]

- **No obvious relationship**
  
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-24-1.png" width="198" />
]

---

## Scatterplots (4/5)

- The strength of the relationship is a description of how closely the data follow the form of the relationship.

.pull-left[
  .center[
    ![A picture shows a strong relationship](data:image/png;base64,#Figures/strong-relation.png)
  ]
]
.pull-right[
  .center[
    ![A picture shows a weaker relationship](data:image/png;base64,#Figures/weaker-relationship.png)
  ]
]

---

## Scatterplots (5/5)

- Outliers are points that deviate from the pattern of the relationship.

.center[
  ![:resize A picture shows a outlier to a relationship, 40%](data:image/png;base64,#Figures/outlier-in-relationship.png)
]

---

## Practice: Match Scatterplots

**A:** X = month (January = 1), Y = rainfall (inches) in Napa, CA in 2010 (Note: Napa has rain in the winter months and months with little to no rainfall in summer.)

**B:** X = month (January = 1), Y = average temperature in Boston MA in 2010 (Note: Boston has cold winters and hot summers.)
]
.pull-right[
**C:** X = year (in five-year increments from 1970), Y = Medicare costs (in $) (Note: the yearly increase in Medicare costs has gotten bigger and bigger over time.)

**D:** X = average temperature in Boston MA (°F), Y = average temperature in Boston MA (°C) each month in 2010

**E:** X = chest girth (cm), Y = shoulder girth (cm) for a sample of men

**F:** X = engine displacement (liters), Y = city miles per gallon for a sample of cars (Note: engine displacement is roughly a measure of engine size. Large engines use more gas.)
]
.footmark[
  Source: [Scatterplots 2 of 5 in Concepts of Statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/scatterplots-2-of-5/)
]

---

## The Correlation Coefficient - Definition (1/2)

- The correlation coefficient `$r$` is a numeric measure that measures the strength and direction of a linear relationship between two quantitative variables.
  $$
  r=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{n-1},
  $$
  where `$n$` is the sample size, `$x$` is a data value for the explanatory variable, `$\bar{x}$` is the mean of the `$𝑥$`-values, `$𝑠_x$` is the standard deviation of the `$𝑥$`-values, and similarly, for the notations involving 𝑦.

- The expression `$z=\frac{x-\bar{x}}{s_x}$` is known as the standardized variable (or `$z$`-score) which
  - doesn't depend on the unit of the variable `$x$`,
  - has mean `$0$` and standard deviation 1.

- In Excel, the correlation coefficient can be calculated using the function `CORREL()`.

- [Scatterplots with different correlation coefficients](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/linear-relationships-2-of-4/)

???
$$
r=\dfrac{\mathbf{x}\cdot\mathbf{y}}{\lVert\mathbf{x}\rVert\cdot \lVert\mathbf{y}\rVert}=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{\sqrt{\sum\left(\frac{x-\bar{x}}{s_x}\right)^2}\cdot\sqrt{\left(\frac{y-\bar{y}}{s_y}\right)^2}}=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{n-1},
$$

---

## The Correlation Coefficient - Definition (2/2)

- **Rounding Rule:** Round to the nearest thousandth for `$r$`, `$m$` and `$b$`.

- Geometric explanation of the definition of `$r$`.

.pull-left[
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-25-1.png" width="252" />

`$r=$` 0.816
]
]
.pull-right[
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-26-1.png" width="252" />

`$r=0.420$`
]
]

- **Remark:**
  - `$r>0$` if all points `$(x-\bar{x}, y-\bar{y})$` are in the 1st and the 3rd quadrants. 
  - `$r<0$` if all points `$(x-\bar{x}, y-\bar{y})$` are in the 2nd and the 4th quadrants.

---

## The Correlation Coefficient - Properties

- The correlation coefficient `$r$` is between `$-1$` and `$1$`.

- The closer the absolute value `$|r|$` is to `$1$`, the stronger the linear relationship is.

- The correlation is symmetric in `$x$` and `$y$`, that is `CORREL(x, y)=CORREL(y, x)`.

- The correlation does not change when the units of measurement of either one of the variables change. In other words, if we change the units of
measurement of the explanatory variable and/or the response variable, it has no effect on the correlation (r).

- The correlation by itself is not enough to determine whether a relationship is linear. It's important to graph data set before analyzing it. [See Francis Anscombe's demonstration both the importance of graphing data and  and the effect of outliers on statistical properties.](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

- The correlation is heavily influenced by outliers. [Try the simulation in Linear Relation (4 of 4) in Concepts in Statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/linear-relationships-4-of-4/)

---

## Guess the Correlation Coefficient

---

## The Correlation Coefficient - Example (1/2)

Describe the relationship between Midterm 1 and Final for a sample of 10 students.

.left-column[
.center[
<table class="table table-striped table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Midterm1 </th>
   <th style="text-align:center;"> Final </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 72 </td>
   <td style="text-align:center;"> 72 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 93 </td>
   <td style="text-align:center;"> 88 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 82 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 82 </td>
   <td style="text-align:center;"> 82 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 94 </td>
   <td style="text-align:center;"> 88 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 80 </td>
   <td style="text-align:center;"> 77 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 73 </td>
   <td style="text-align:center;"> 78 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 71 </td>
   <td style="text-align:center;"> 77 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 76 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 76 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 63 </td>
   <td style="text-align:center;"> 68 </td>
  </tr>
</tbody>
</table>
]
]
--
.right-column[
**Solution:** First we create a scatterplot.
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-28-1.png" width="216" />
]
Using the Excel function `CORREL(x, y)`, we find the correlation coefficient is
`$r=0.905$` .

The `$r$`-value shows a **strong positive linear** relationship.
]

---

## The Correlation Coefficient - Example (2/2)

- `$r$` can also be calculation by hand using the formula.
`$\dfrac{\sum z_xz_y}{n-1}$`, where `$z_x=\frac{x-\bar{x}}{s_x}$` and 
`$z_y=\frac{y-\bar{y}}{s_y}$`.
]
.right-column[
| Midterm1 | Final    | z_x      | z_y      | z_xy        |
| -------- | -------- | -------- | -------- | ----------- |
| 72       | 72       | -0.78006 | -1.06926 | 0.834087814 |
| 93       | 88       | 1.50088  | 1.544483 | 2.318083715 |
| 81       | 82       | 0.197484 | 0.56433  | 0.111446332 |
| 82       | 82       | 0.306101 | 0.56433  | 0.172741815 |
| 94       | 88       | 1.609497 | 1.544483 | 2.485839773 |
| 80       | 77       | 0.088868 | -0.25246 | -0.02243591 |
| 73       | 78       | -0.67145 | -0.0891  | 0.059829084 |
| 71       | 77       | -0.88868 | -0.25246 | 0.224359064 |
| 81       | 76       | 0.197484 | -0.41582 | -0.08211835 |
| 81       | 76       | 0.197484 | -0.41582 | -0.08211835 |
| 63       | 68       | -1.75761 | -1.72269 | 3.027820885 |
| 79.18182 | 78.54545 |<- mean   | sum ->   | 9.047535876 |
| 9.206717 | 6.121497 |<- stdev.s|correl -> | 0.904753588 |
]

---

## Practice: Years And Winning Times

.pull-left[
The tables show a sample of 23 records on years and winning times for the 1,500 meter race in Olympic Games.
  - Draw a scatter plot for the data table.
  - Is it appropriate to study the relationship using a linear model.
  - Find and interpret the correlation coefficient.
]
.pull-right[
.pull-left[
<table class="table table-striped table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Year </th>
   <th style="text-align:center;"> Time </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1900 </td>
   <td style="text-align:center;"> 246.0 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1904 </td>
   <td style="text-align:center;"> 245.4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1908 </td>
   <td style="text-align:center;"> 243.4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1912 </td>
   <td style="text-align:center;"> 236.8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1920 </td>
   <td style="text-align:center;"> 241.8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1924 </td>
   <td style="text-align:center;"> 233.6 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1928 </td>
   <td style="text-align:center;"> 233.2 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1932 </td>
   <td style="text-align:center;"> 231.2 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1936 </td>
   <td style="text-align:center;"> 227.8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1948 </td>
   <td style="text-align:center;"> 229.8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1952 </td>
   <td style="text-align:center;"> 225.1 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1956 </td>
   <td style="text-align:center;"> 221.2 </td>
  </tr>
</tbody>
</table>
]
.pull-right[
<table class="table table-striped table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Year </th>
   <th style="text-align:center;"> Time </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1960 </td>
   <td style="text-align:center;"> 215.60 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1964 </td>
   <td style="text-align:center;"> 218.10 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1968 </td>
   <td style="text-align:center;"> 214.90 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1972 </td>
   <td style="text-align:center;"> 216.30 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1976 </td>
   <td style="text-align:center;"> 219.20 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1980 </td>
   <td style="text-align:center;"> 218.40 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1984 </td>
   <td style="text-align:center;"> 212.53 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1988 </td>
   <td style="text-align:center;"> 215.96 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1992 </td>
   <td style="text-align:center;"> 220.12 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1996 </td>
   <td style="text-align:center;"> 215.78 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> 212.32 </td>
  </tr>
</tbody>
</table>
] 
]

.footmark[Source: [suny-wmopen-concepts-statistics](https://courses.lumenlearning.com/suny-wmopen-concepts-statistics/chapter/assignment-linear-regression/)]

---

## Correlation v.s. Causation

- Correlation is described by data from observational study. Observational studies cannot prove cause and effect which requires controlled study and rigorous inferences.

- Correlation may be used to make a prediction which is probabilistic.

- In a linear relationship, an `$r$`-value that is close to 1 or -1 is insufficient to claim that the explanatory variable causes changes in the response variable. The correct interpretation is that there is a statistical relationship between the variables.

- A **lurking variable** is a variable that is not measured in the study, but affects the interpretation of the relationship between the explanatory and response variables.

---

## Example Correlation v.s. Causation (1/2)

The scatterplot below shows the relationship between the number of firefighters sent to fires (x) and the amount of damage caused by fires (y) in a certain city.
![](data:image/png;base64,#Figures/scatterplot-firefigters.png)

Can we conclude that the increase in firefighters causes the increase in damage?

.footmark[
  Source: [Causation and Lurking Variables in Concepts in Statistics for more example](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/linear-relationships-4-of-4/)
]

---

## Example Correlation v.s. Causation (2/2)

**Solution:**

1. Correlation: The more fire fighters, the more likely there is bigger damage. However the fire fighters do not cause the fire.
  
2. Prediction: You could predict the amount of damage by looking at the number of fire fighters present.
  
3. Causation: The fire fighters are unlikely the cause of the fire.
  
4. Lurking variable: The seriousness of the fire is a lurking variable.

---

## Lab: Scatter Plots and Correlation Coefficient

- To create a scatter plot, first select the data sets, and then look for `Insert Scatter(X, Y)` in the menu `Insert`-> `Charts`.

- The correlation coefficient `$r$` can be calculated by the Excel function `correl()`.

---

# Linear Regression

---

## Learning Goals for Linear Regressions

- Summarize and interpret the relationship between two quantitative variables.

- Demonstrate understanding of concepts pertaining to linear regression.

- Use regression equations to make predictions and understand its limits.

---

## The Regression Line (1/2)

- The line that best summarizes a linear relationship is **the least squares regression line**. The regression line is the line with the smallest sum of squares of the errors (**SSE**).

- We use the least-squares regression line to predict the value `$\hat{y}$` for a value of the explanatory variable `$x$`.

- The regression line is unique and passes though `$(\bar{x}, \bar{y})$`. The equation is given by
  `$$\hat{y}=m(x-\bar{x})+\bar{y}=m x+b,$$`
  where the slope is `$$m=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sum(x-\bar{x})^2}=r\frac{s_y}{s_x}$$` and the `$y$`-intercept is
  `$b=\bar{y}-m\bar{x}.$`

---

## The Regression Line (2/2)

- The **error of a prediction** is
  `$$\text{Error}=\text{Observed}-\text{Predicted}=y-\hat{y}.$$`

- A prediction beyond the range of the data is called **extrapolation**.

---

## Example: Old Faithful Geyser (1/2)

The following sample is taken from data about the Old Faithful geyser.

1. Study the linear relationship.
2. Find the regression line, and the predicated value and the error if the eruption time is 1.8 minutes.

.pull-left[
.center[
<table class="table table-striped table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> eruptions </th>
   <th style="text-align:center;"> waiting </th>
   <th style="text-align:center;"> eruptions </th>
   <th style="text-align:center;"> waiting </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 3.917 </td>
   <td style="text-align:center;"> 84 </td>
   <td style="text-align:center;"> 1.75 </td>
   <td style="text-align:center;"> 62 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 4.200 </td>
   <td style="text-align:center;"> 78 </td>
   <td style="text-align:center;"> 4.80 </td>
   <td style="text-align:center;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1.750 </td>
   <td style="text-align:center;"> 47 </td>
   <td style="text-align:center;"> 1.60 </td>
   <td style="text-align:center;"> 52 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 4.700 </td>
   <td style="text-align:center;"> 83 </td>
   <td style="text-align:center;"> 4.25 </td>
   <td style="text-align:center;"> 79 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2.167 </td>
   <td style="text-align:center;"> 52 </td>
   <td style="text-align:center;"> 1.80 </td>
   <td style="text-align:center;"> 51 </td>
  </tr>
</tbody>
</table>
]
]
.pull-right[
  .center[
  <img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-32-1.png" width="216" />
  ]
]

---

## Example: Old Faithful Geyser (2/2)

**Solution:** The Scatterplot shows a linear relationship.

To find the regression line, we use the Excel function `SLOPE()`. In this example, `$m=  10.836$`.
  
To find `$m$` and use the Excel function `INTERCEPT()` to find the `$y$`-intercept `$b$`. In this example, `$b=  33.68$`.
  
The equation of the line is `$\hat{y}=10.836x + 33.68$`.

When `$x=1.8$`, we have `$\hat{y}=10.836*1.8 + 33.68= 53.1848$`.

The error is
  `$y-\hat{y}=51-53.1848= -2.1848$`. That means the predication over-estimates the eruption time about -2.18 minutes.

---

## Practice: Years and Winning Times

.pull-left[
The tables show a sample of 23 records on years and winning times for the 1,500 meter race in Olympic Games.

- Is it appropriate to study the relationship using a linear model.
- Find an equation of the regression.
- Make a prediction of the winning time for the year 1998.
- What is the residual for the year 1992.
- Find and interpret the coefficient of determination.
]
.pull-right[
.pull-left[
<table class="table table-striped table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Year </th>
   <th style="text-align:center;"> Time </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1900 </td>
   <td style="text-align:center;"> 246.0 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1904 </td>
   <td style="text-align:center;"> 245.4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1908 </td>
   <td style="text-align:center;"> 243.4 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1912 </td>
   <td style="text-align:center;"> 236.8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1920 </td>
   <td style="text-align:center;"> 241.8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1924 </td>
   <td style="text-align:center;"> 233.6 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1928 </td>
   <td style="text-align:center;"> 233.2 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1932 </td>
   <td style="text-align:center;"> 231.2 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1936 </td>
   <td style="text-align:center;"> 227.8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1948 </td>
   <td style="text-align:center;"> 229.8 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1952 </td>
   <td style="text-align:center;"> 225.1 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1956 </td>
   <td style="text-align:center;"> 221.2 </td>
  </tr>
</tbody>
</table>
]
.pull-right[
<table class="table table-striped table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Year </th>
   <th style="text-align:center;"> Time </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 1960 </td>
   <td style="text-align:center;"> 215.60 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1964 </td>
   <td style="text-align:center;"> 218.10 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1968 </td>
   <td style="text-align:center;"> 214.90 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1972 </td>
   <td style="text-align:center;"> 216.30 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1976 </td>
   <td style="text-align:center;"> 219.20 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1980 </td>
   <td style="text-align:center;"> 218.40 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1984 </td>
   <td style="text-align:center;"> 212.53 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1988 </td>
   <td style="text-align:center;"> 215.96 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1992 </td>
   <td style="text-align:center;"> 220.12 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 1996 </td>
   <td style="text-align:center;"> 215.78 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> 212.32 </td>
  </tr>
</tbody>
</table>
] 
]

.footmark[Source: [suny-wmopen-concepts-statistics](https://courses.lumenlearning.com/suny-wmopen-concepts-statistics/chapter/assignment-linear-regression/)]
---

## Assessing the Fit of a Regression Line (1/2)

- The prediction error is also called a **residual**. Another way to express the previous equation is
  `$$y=\hat{y}+\text{residual}.$$`

- **Residual plots** are used to determine if a linear model is appropriate. 
  
- A random pattern (or no obvious pattern) indicates a good fit of a linear model.  [See Assessing the Fit of a Line (2 of 4) in Concepts in Statistics for examples.](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/assessing-the-fit-of-a-line-2-of-4/)

- One measure of the fit of a regression line is the proportion of the variation in the response variable that is explained by the least-squares regression line.
  - The **total variance** is  `$SSD=\sum(y-\bar{y})^2$`
  - The **explained variance** is `$SSR=\sum(\hat{y}-\bar{y})^2$`.
  - The **coefficient of determination** is
    `$$r^2=\dfrac{SSR}{SSD}=\dfrac{\sum(y-\bar{y})^2}{\sum(\hat{y}-\bar{y})^2}.$$`

---

## Assessing the Fit of a Regression Line (2/2)

- Another measure of the fit of the regression is the **residual standard errors** (or **standard error of the regression**), calculated by the Excel function `STEYX()`,  is
  `$$s_e=\sqrt{\dfrac{SSE}{n-2}},$$`
  where `$SSE=\sum (y-\hat{y})^2$` is the sum of square errors.

- The smaller `$s_e$` is, the more accurate the prediction is.

**Remark:**

- The `$r$` in the coefficient of determination is the correlation coefficient. Equivalently, `$r=\pm\sqrt{r^2}$`.
- The smaller the standard error, the larger the coefficient of determination:
  `$$r^2=1-\dfrac{SSE}{SSD}=1-\dfrac{(n-2)s_e^2}{SSD}.$$`

???

- `$n−2$` is the degrees of freedom. We lose two degrees of freedom because we estimate the slope and the `$y$`-intercept.

- In a linear regression model `$Y=\beta_0 + \beta_1 X +\epsilon$`, even we have `$\beta_0$` and `$\beta_1$` from the population, we still need estimate the standard deviation of error.

---

## Example: Coefficient of Determination

Find the coefficient of determination for the data of midterm1 and final
.pull-left[
.center[
<table class="table table-striped table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Midterm1 </th>
   <th style="text-align:center;"> Final </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 72 </td>
   <td style="text-align:center;"> 72 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 93 </td>
   <td style="text-align:center;"> 88 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 82 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 82 </td>
   <td style="text-align:center;"> 82 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 94 </td>
   <td style="text-align:center;"> 88 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 80 </td>
   <td style="text-align:center;"> 77 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 73 </td>
   <td style="text-align:center;"> 78 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 71 </td>
   <td style="text-align:center;"> 77 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 76 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 76 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 63 </td>
   <td style="text-align:center;"> 68 </td>
  </tr>
</tbody>
</table>
]
]

**Solution:**

The correlation coefficient is `$0.905$`.

The coefficient determination is
`$$r^2=0.905^2\approx 0.819.$$`
]

---

## Example: Residual Standard Errors

Find the residual standard error of the regression line for the data of midterm1 and final
.pull-left[
.center[
<table class="table table-striped table-hover table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Midterm1 </th>
   <th style="text-align:center;"> Final </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 72 </td>
   <td style="text-align:center;"> 72 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 93 </td>
   <td style="text-align:center;"> 88 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 82 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 82 </td>
   <td style="text-align:center;"> 82 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 94 </td>
   <td style="text-align:center;"> 88 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 80 </td>
   <td style="text-align:center;"> 77 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 73 </td>
   <td style="text-align:center;"> 78 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 71 </td>
   <td style="text-align:center;"> 77 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 76 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 81 </td>
   <td style="text-align:center;"> 76 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> 63 </td>
   <td style="text-align:center;"> 68 </td>
  </tr>
</tbody>
</table>
]
]

**Solution:** In Excel we can use `STEXY()` to find the residual standard error.
The residual standard errors of the regression in the Old Faithful example is `$s_e\approx 5.258$`.
]

---

## Practice: Exercises from the Textbook
  
- [Making Predictions](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/linear-regression-1-of-4/)

- [Regression Equation](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/linear-regression-3-of-4/)

- [Residual Plots](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/assessing-the-fit-of-a-line-2-of-4/)

- [Coefficient of Determination](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/assessing-the-fit-of-a-line-3-of-4/)

- [Standard Error](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/assessing-the-fit-of-a-line-4-of-4/)

---

## Lab: Slope, `$y$`-intercept, `$r^2$` and `$s_e$`

- The slope of a linear regression can be calculated by the Excel function `SLOPE()`.

- The `$y$`-intercept of a linear regression can be calculated by the Excel function `INTERCEPT()`.

- The coefficient of determination can be calculated by first finding `$r$`, then applying the formula `r^2`.

- The standard error of the regression (residual standard error) can be calculated by the Excel function `STEYX()`.

---
class: center middle topic

# Two-Way Tables and Relations Between Categorical Variables

---

## Learning Goals for Two-way Tables

- Summarize and interpret the relationship between two two qualitative (categorical) variables using two-way tables.

- Demonstrate understanding and find conditional, joint and marginal probability from a two-way frequency table.

- Create and analyze two-way table to answer probability questions.

---

## Two-way Frequency Tables (1/2)

- As we organize and analyze data from two categorical variables, we make use of two-way tables.

- Information in a **two-way frequency table**:
  
  - Values of the two variables are displayed in the left column and the top row.
  
  - The body of table consists of frequency counts associated to pairs of values of the two variables.
  
  - The right column and the bottom row, which are called margins of the table, consists of row totals and column totals respectively.

---

## Two-way Frequency Tables (2/2)

- A number in a margin are called **marginal frequency** or **marginal distribution**.

- A numbers in the body of the table is called **joint frequency**.

---

## Example: Body Image and Gender

The following table summarize responses of a random sample of 1,200 U.S. college students as part of a larger survey.

.center[
|                   | About Right | Overweight | Underweight | Row Totals |
| ----------------- | --------------- | -------------- | --------------- | -------------- |
| Female        | 560             | 163            | 37              | 760            |
| Male          | 295             | 72             | 73              | 440            |
| Column Totals | 855             | 235            | 110             | 1,200          |
]

.footmark[
  Source: [https://courses.lumenlearning.com/wmopen-concepts-statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/two-way-tables-1-of-5/)
]

---

## Two-Way Relative Frequency Tables and Probability

- A **two-way relative frequency table** is obtained from a two-way frequency table by converting  frequencies in a two-way table to relative frequencies.

- **Marginal probability**
  `$$P(X)=\frac{\text{Marginal frequency in}~ X}{\text{Total}}$$`

- **Conditional probability**
  `$$P(X|Y)=\frac{\text{Joint frequency}}{\text{Marginal Frequency in}~Y} \quad \text{or}\quad P(Y|X)=\frac{\text{Joint frequency}}{\text{Marginal Frequency in}~X}$$`

- **Joint probability**
  `$$P(X ~\text{and}~ Y)=\frac{\text{Joint frequency}}{\text{Total}}$$`

- Note that `$P(X~\text{and}~Y)=P(X)\cdot P(Y|X)=P(Y)\cdot P(X|Y).$`

---

## Example: Joint and marginal probabilities of body image and gender

The following table shows joint and marginal probabilities of body image and gender.

.center[
|                 | About Right | Overweight | Underweight | Row Totals |
| --------------- | ----------- | ---------- | ----------- | ---------- |
| Female          | `$\frac{560}{1200}=46.67\%$`      | `$\frac{163}{1200}=13.58\%$`     | `$\frac{37}{1200}=3.08\%$`       | `$\frac{760}{1200}=63.33\%$`     |
| Male            | `$\frac{295}{1200}=24.58\%$`      | `$\frac{72}{1200}=6.00\%$`      | `$\frac{73}{1200}=6.08\%$`       | `$\frac{440}{1200}=36.67\%$`     |
| Column   Totals | `$\frac{855}{1200}=71.25\%$`      | `$\frac{235}{1200}=19.58\%$`     | `$\frac{110}{1200}=9.17\%$`       | `$\frac{1200}{1200}=100.00\%$`    |
]

---

## Example: Conditional probabilities of body image by gender

The following table shows probabilities of randomly select male or female who has a certain body image.

.center[
  |                 | About Right | Overweight | Underweight | Row Totals |
| --------------- | ----------- | ---------- | ----------- | ---------- |
| Female          | `$\frac{560}{760}=73.68\%$`      | `$\frac{163}{760}=21.45\%$`     | `$\frac{37}{760}=4.87\%$`       | `$\frac{760}{7600}=100.00\%$`    |
| Male            | `$\frac{295}{440}=67.05\%$`      | `$\frac{72}{440}=16.36\%$`     | `$\frac{73}{440}=16.59\%$`      | `$\frac{440}{440}=100.00\%$`    |
]

---

## Example: Community College Enrollment (1/2)

The following table summarizes the full-time enrollment at a community college.

.center[
  |                   | Arts-Sci | Bus-Econ | Info Tech | Health Science | Graphics Design | Culinary Arts | Row Totals |
| ----------------- | ------------ | ------------ | ------------- | ------------------ | ------------------- | ----------------- | -------------- |
| Female        | 4,660        | 435          | 494           | 421                | 105                 | 83                | 6,198          |
| Male          | 4,334        | 490          | 564           | 223                | 97                  | 94                | 5,802          |
| Column Totals | 8,994        | 925          | 1,058         | 644                | 202                 | 177               | 12,000         |
]

What proportion of the total number of students are male students?

--
**Solution:**
`$$P(\text{Male})=\dfrac{5802}{12000}\approx 0.4835=48.35\%.$$`

.footmark[
  Source: [https://courses.lumenlearning.com/wmopen-concepts-statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/two-way-tables-2-of-5/)
]

---

## Example: Community College Enrollment (2/2)

If we select a male student at random, what is the probability that he is in the Info Tech program?

--
**Solution:**
`$$P(\text{Info Tech}|\text{Male})=\dfrac{564}{5802}\approx 0.097=9.7\%.$$`

--
If we select a student at random, what is the probability that the student is both a male and in the Info Tech program?

--
**Solution:**
`$$P(\text{Male and Info Tech})=\dfrac{564}{12000}= 0.047=4.7\%.$$`

--
The probabilities are related:

--
**Solution:**
`$$P(\text{Male and Info Tech})=\dfrac{564}{12000}=\dfrac{5802}{12000}\cdot \dfrac{564}{5802}=P(\text{Male})\cdot P(\text{Info Tech}|\text{Male}).$$`

---

## Practice: A table relates weights and heights

This table relates the weights and heights of a group of individuals participating in an observational study.

| Weight/Height | Tall | Medium | Short |
| ------------ | --- | ----- | ---- |
| Obese         | 18   | 28     | 14    |
| Normal        | 20   | 51     | 28    |
| Underweight   | 12   | 25     | 9     |

1. Find the total for each row and column
2. Find the probability that a randomly chosen individual from this group is Short.
3. Find the probability that a randomly chosen individual from this group is Obese and Short.
4. Find the probability that a randomly chosen individual from this group is Underweight given that the individual is Tale.

---

## Test of (No) Association

- To understand association between categorical variables, we may think conversely. How do we test no association?

- If the conditional probabilities are nearly equal for all categories, there may be no association between the variables. Conversely, if the conditional probabilities are different enough, we are confidence to say there is an association.
  
- In general, the bigger the differences in the conditional probabilities, the stronger the association between the variables.

- Two variables `$X$` and `$Y$` are **independent** if `$P(X~\text{and}~Y)=P(X)\cdot P(Y)$`.

---

## Example: Association between body image and gender (1/2)

Is body image related to gender?

## Example: Association between body image and gender (2/2)

**Solution:** Using Excel (stacked bar chart), we may compare side-by-side the conditional body image distributions for females and males
.center[
![resize: Stacked Bar Chart for Gender and Body Images, 80%](data:image/png;base64,#Figures/Gender-Body.png)
]

As a result of our analysis, we know that the conditional distributions for males and females for body image are not the same. There is enough of a difference to believe that those two categorical variables are in fact related.

---

## Percentage Reduction of Risk

- When calculating the probability of a negative outcome, we often refer to the probability as a **risk**.

- In general, we are interested in determining how much a new treatment reduces the risk compared to a reference risk

- The **percentage reduction of risk** is

`$$\text{percentage reduction of risk}=\frac{\text{new treatment risk}-\text{reference risk}}{\text{reference risk}}.$$`

---

## Example: Risk and the Physicians’ Health Study (1/2)

Researchers in the Physicians’ Health Study (1989) designed a randomized double-blind experiment to determine whether aspirin reduces the risk of heart attack. Here are the final results.

|                   | **Heart Attack** | **No Heart Attack** | **Row Totals** |
| ----------------- | ---------------- | ------------------- | -------------- |
| **Aspirin**       | 139              | 10,898              | 11,037         |
| **Placebo**       | 239              | 10,795              | 11,034         |
| **Column Totals** | 378              | 21,693              | 22,071         |

*Does aspirin lower the risk of having a heart attack?*

.footmark[
  Source: [https://courses.lumenlearning.com/wmopen-concepts-statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/two-way-tables-4-of-5/)
]

---

## Example: Risk and the Physicians’ Health Study (2/2)

**Solution:** To answer this question, we compare two conditional probabilities:

- The probability of a heart attack given that aspirin was taken every other day.
  `$$P(\text{heart attack}|\text{aspirin}) = 139 / 11,037 = 0.013$$`
- The probability of a heart attack given that a placebo was taken every other day.
  `$$P(\text{heart attack}|\text{placebo}) = 239 / 11,034 = 0.022$$`

The result shows that taking aspirin reduced the risk from 0.022 to 0.013.

The percentage reduction of risk is
$$
\frac{\text{0.013}-\text{0.022}}{\text{0.022}}=\frac{-\text{0.009}}{\text{0.022}}\approx -\text{0.41}.
$$

Therefore, we conclude that taking aspirin results in a 41% reduction in risk.

---

## Hypothetical Two-way Tables

A **hypothetical two-way table**, also known as a hypothetical 1000 two-way table, is a two-way table constructed from given probability conditions with 1000 as the total frequency. It can be used to answer complex probability questions.
  
---

## Example: Birth gender prediction (1/2)

A pregnant woman often opts to have an ultrasound to predict the gender of her baby.
Assume the following facts are known:

- Fact 1: 48% of the babies born are female.
- Fact 2: The proportion of girls correctly identified is 9 out of 10.
- Fact 3: The proportion of boys correctly identified is 3 out of 4.

Use the above facts to answer the following questions.

1. If the examination predicts a girl, how likely the baby will be a girl?

2. If the examination predicts a boy, how likely the baby will be a boy?

---

## Example: Birth gender prediction (2/2)

**Solution:** Assume that we have ultrasound predictions for 1,000 random babies.

- Fact 1 means that `$P(\text{Girl})=48\%$`.
- Fact 2 means that `$P(\text{Predicted as girl}|\text{Girl})=9/10$`.
- Fact 3 means that `$P(\text{Predicted as boy}|\text{Boy})=3/4$`

Using those facts, we may create a two-way frequency table.
.center[
|                 | Girl           | Boy             | Row Totals  |
| --------------- | -------------- | --------------- | ----------- |
| Predict Girl  | 0.90(480)= 432 | 0.25(520) = 130 | 432+130=562 |
| Predict Boy   | 480-432=48     | 520-130=390     | 48+390=438  |
| Column   Totals | 480            | 1000-480=520    | 1,000       |
]

If the examination predicts a girl, the probability that the born baby is a girl is
`$$P(\text{Girl}|\text{predict girl})=\frac{432}{562} \approx 0.769=76.9\%.$$`

If the examination predicts a boy, the probability that the born baby is a boy is
`$$P(\text{Boy} | \text{predict boy}) = \frac{390}{438} \approx 0.890=89\%.$$`

---

## Practice

The table below is based on a 1988 study of accident records conducted by the Florida State Department of Highway Safety.
.center[
|                   | **Nonfatal** | **Fatal** | **Row Totals** |
| ----------------- | ------------ | --------- | -------------- |
| **Seat Belt**     | 412,368      | 510       | 412,878        |
| **No Seat Belt**  | 162,527      | 1,601     | 164,128        |
| **Column Totals** | 574,895      | 2,111     | 577,006        |
]

*Does wearing a seat belt lower the risk of an accident resulting in a fatality?*

.footmark[Source: [https://courses.lumenlearning.com/wmopen-concepts-statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/two-way-tables-4-of-5/)
]

---

## Practice: Drug screening

A large company has instituted a mandatory employee drug screening program. Assume that the drug test used is known to be 99% accurate. That is, if an employee is a drug user, the test will come back positive (“drug detected”) 99% of the time. If an employee is a non-drug user, then the test will come back negative (“no drug detected”) 99% of the time. Assume that 2% of the employees of the company are drug users.

If an employee’s drug test comes back positive, what is the probability that the test is wrong and the employee is in fact a non drug user?

.footmark[
  Source: [https://courses.lumenlearning.com/wmopen-concepts-statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/two-way-tables-5-of-5/)
]

---

## Create Stacked Bar Chart

- To create a a stacked bar chart of a two-way table
  
  - First select the data table.
  
  - Look for and click `Insert Column or Bar Chart` in the menu `Insert`-> `Charts`.
  
  - In the dropdown menu, choose the third option in 2-D Column (`100% Stacked Column`) or the third option 2-D Bar (`100% Stacked Bar`).
  
  - To switch row/column, in the output graph, right click the row axis or the column axis, and chose the option `Select Data...` to make a switch.

---

## Practice: Is there an association between gender and program selection

The following table summarize results from a study on program selection and gender.

Use Excel to answer the following question about the study.

- Is there an association between gender and program selection? Why or why not?

- If they are associated, is the association strong or week?

---
class: center middle topic

# Basic Concepts of Probability

---

## Learning Goals for Basic Concepts of Probability

- Construct sample spaces and calculate probabilities for simple and/or compound events.

- Use appropriate rules of probability for compound events.

---

## Experiments, Sample Spaces, and Events

- An **experiment** is a procedure that can be infinitely repeated and has a well-defined set of outcomes.

- An **outcome** is the result of a single trial (individual repetition) of an experiment.

- A **chance experiment** is an experiment that has more than one possible outcome and whose outcomes cannot be predicted with certainty.

- The **sample space** of a random experiment is the set of all possible outcomes.

- An **event** is a subset of the sample space.

---

## Example: Chance Event

.footmark[
  Source: [https://seeing-theory.brown.edu/basic-probability/index.html#section1](https://seeing-theory.brown.edu/basic-probability/index.html#section1)
]
---

## Complement, Intersection and Union

- The **complement** `$E^c$` of event `$E$` is the set of all outcomes in a sample space that are **NOT** included in
event `$E$`.

- The **intersection** `$A\cap B$` of two events `$A$` and `$B$` is the set of all outcomes in the sample space that are shared by `$A$` and `$B$`.

- The **union** `$A\cup B$` of two events `$A$` and `$B$` is the set of all outcomes in the sample space that are either in `$A$` or `$B$`.

- Two events `$A$` and `$B$` are **mutually exclusive** if there intersection `$A\cap B$` is empty.

---

## Venn Diagrams for Complement, Intersection, and Union

---

## Classical Definition of Probability

- A **probability** `$P(E)$` is the measures of how likely an outcome in the event `$E$` will occur in a probability experiment.

- **Equally likely** means that each outcome of an experiment occurs with equal chance.

- When the outcomes in the sample space of an chance experiment are ***equally likely***, the probability of an event `$E$` is
  `$$P(E)=\dfrac{\text{number of outcomes in }E}{\text{number of outcomes in }S}$$`

- Chance experiment that involves tossing fair coins, rolling fair dice and drawing a card from a well-mixed deck of cards have equally likely outcomes.

- Note that many chance experiment do not have equally likely outcomes. For example, the majors of students in a class are not equally likely outcomes.

---

## Example: Flipping a coin

Imagine flipping one fair coin (which means the chance of a head and the chance of a tail are the same). What is the probability of getting the head.

**Solution:**
There are two possible outcomes: **Head** or **Tail**.

So the sample space is the set
`$$S = \{\text{Head}, \text{Tail}\}.$$`

The event `$E$` of getting a head is the subset
`$$E=\{\text{Head}\}.$$`

The probability `$P(E)=\dfrac{1}{2}=0.5$`.

---

## Empirical Probability

- An **empirical (or a statistical) probability** is the relative frequency of occurrence of outcomes from observations in repeated experiments:

$$
`\begin{aligned}
  P(E)=&\dfrac{\text{number of occurrence of event } E}{\text{total number of observations}}\\[0.5em]
  =&\dfrac{\text{frequency in }E}{\text{total frequency}}.
\end{aligned}`
$$

---

## Example: Chance of selecting a math major

A statistics class has 5 math majors and 20 other majors. If a students was randomly select from the class, what's the probability that the selected students is a math major?

**Solution:**
The sample space is the set of all students in the statistics class.
The event is the set of the 5 math majors.
Then the probability is
$$
  P(E)=\dfrac{\text{frequency of math majors}}{\text{total frequency of students}}=\dfrac{5}{25}=0.2.
$$

---

## Theoretical Probability

- Theoretical probability is an expected value that can be calculated by mathematical theory and assumptions.

- When all outcomes in the sample space are equally likely, the probability of a desired event `$E$`, known as a **theoretical probability**,  is calculated by

$$
  P(E)=\dfrac{\text{number of desired outcomes for event }E}{\text{number of all possible outcomes}}.
  $$

- **Tree diagrams** are often used for counting all possible outcomes.

---

## Example: Flipping a coin twice

Find the probability of getting two heads when tossing a fair coins twice.

**Solution:** In the first time, the coin has two possible outcomes. The second time, the coin still has two possible outcomes. By the fundamental counting principle, we know that the sample space `$S$` contains `$2\cdot 2=4$` possible outcomes.
.center[
![A tree demonstrating outcomes of flipping a coin twice.](data:image/png;base64,#Figures/coin-tree.jpeg)
]
The event `$E$` of getting 2 heads contains only one outcome: head and head. So the probability of getting two head when flipping a fair coin twice is
`$$P(E)=\frac{1}{4}.$$`

---

## Empirical vs Theoretical: Coin Flip Simulation

The purpose of this activity is to experiment with a simulation of flipping a **fair** coin, and to see if the P(H) = 0.5.

.footmark[
  Source: [GeoGebra](http://ggbtu.be/mLZbwMZtJ) License: [CC BY SA](http://creativecommons.org/licenses/by/3.0/us/)
]

---

## Law of Large Numbers

- The empirical probability of an event is an "estimate" that based upon observed data from an experiment.

- The theoretical probability of an event is an "expected" probability based upon counting rules.

- **Law of Large Numbers:** As an experiment is repeated over and over, that is the number of trials getting larger and larger,  the empirical probability of an event approaches the theoretical probability of the event. ([Wiki: Law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers).)

- By the law of large number, we can say that the probability of any event is the **long-term relative frequency** of that event.

---

## Example: Law of large numbers by a coin flipping simulation

A demonstration for the law of large number by simulating coin flipping.

.footmark[
  Source: [http://digitalfirst.bfwpub.com/stats_applet/stats_applet_10_prob.html](http://digitalfirst.bfwpub.com/stats_applet/stats_applet_10_prob.html)
]
---

## Practice: Red light runner

---

## Practice: Rolling two dice

Two **fair** dice are thrown. Find the probabilities of the following events:

- the sum of the two numbers is 3.
- the sum of the two numbers is at most 3.

---

## Fundamental Properties (that define probability)

- **Property 1:** For an event `$E$`, the probability `$P(E)$` is ranged from 0 to 1:
  `$$0\leq P(E)\leq 1.$$`

- **Property 2:** If `$S$` is the sample space, then `$P(S)=1$`.

- **Property 3:** The probability of an event `$E=\{e_1,e_2, \cdots e_k\}$` of distinct outcome is equal to the sum of probabilities of individual outcomes:
  `$$P(E)=P(e_1)+P(e_2)+\cdots+P(e_k)$$`
  where `$P(e_i)$` is the probability of getting the outcome `$e_i$`.

**Remark:** When an event `$E$` consists of infinitely many outcomes, the right hand side of the equality in Property 3 will be an infinite sum.

---

## Two Easy Consequences

- **Easy consequence 1:** If events `$A$` and `$B$` are mutually exclusive, then
  `$$P(A\cup B)=P(A)+P(B).$$`

- **Easy consequence 2:** The probability `$P(E)$` of an event `$E$` and the probability `$P(E^c)$` of the complement event `$E^c$` satisfies the identity:
  `$$P(E)+P(E^c)=1.$$`

Equivalently,
  $$
  P(E^c)=1-P(E)\quad\text{or}\quad P(E)=1-P(E^c).
  $$

---

## Example: Six-sided die (1/2)

A six-sided fair die is rolled.

1. Find the probability of the event `$E$` getting a number less than 3, that is, `$E=\{x \mid x <3\}$`.
2. Find the probability of the complement `$E^c$` of the event `$E$`.
3. Verify that `$P(E)+P(E^c)=1$`.

---

## Example: Six-sided die (2/2)

**Solution:** The sample space of the six-sided die is `$\{1,2,3,4,5,6\}$`.

The event `$E$` consists of 2 numbers: `$E=\{1, 2\}$`.

The probability is
$$
P(E)=P(x=1)+P(x=2)=\frac16+\frac16=\frac26=\frac13.
$$

The complement `$E^c=\{3, 4, 5, 6\}$` and the probability is
$$
`\begin{aligned}
  P(E^c)=&P(x=3)+P(x=4)+P(x=5)+P(x=6)\\
  =&\frac16+\frac16+\frac16+\frac16
  =\frac46=\frac23.
\end{aligned}`
$$

It is clear that `$P(E)+P(E^c)=\frac13+\frac23=1.$`

---

## Example: Sum of two numbers from rolling two dice (1/2)

Two six-sided fair dice were rolled. Find the probability of getting two numbers whose sum is at least 4.

**Solution:** Let `$E$` be the event of the sum is at least 4. Then the complement `$E^c$` consists pairs of numbers whose sum are at most 3. There are 3 such pairs:
`$$E^c=\{(1, 1), (1, 2), (2, 1)\}.$$`

The sample space contains `$6\cdot 6=36$` possible out comes.

.footmark[
  Source of image: [https://sasandr.wordpress.com/2012/05/04/rolling-the-dice-ii/](https://sasandr.wordpress.com/2012/05/04/rolling-the-dice-ii/)
]

---

## Example: Sum of two numbers from rolling two dice (2/2)

**Solution: (continued)**

Therefore,
$$
`\begin{aligned}
  P(E^c)=&P((1,1))+P((1,2))+P((2,1))\\
  =&\frac16\cdot\frac16+\frac16\cdot\frac16+\frac16\cdot\frac16=\frac{1}{12}.
\end{aligned}`
$$

Apply the complement rule, we find
`$$P(E)=1-P(E^c)=\frac{11}{12}.$$`

---

## Practice: Spinner with 12 numbers

---

## Practice: M&M with a specific color

---

## Probability for Chance Experiment with Equally likely Outcomes

When outcomes in the sample spaces are equally likely,

- the **probability of the intersection** of two events is
  $$
  P(A\cap B)=\dfrac{\text{numbers of elements in } A\cap B}{\text{number of elements in the sample space }S}.
  $$

- the **probability of the union** of two event is
  $$
  P(A\cup B)=\dfrac{\text{numbers of elements in } A\cup B}{\text{number of elements in the sample space }S}.
  $$

---

## The Addition Rule

In general, the probability of the union of two events from a chance experiment is defined by the basic rules and the addition rule.

- **Addition Rule:** the probability of the union of two events `$A$` and `$B$` is
  $$
  P(A\cup B)=P(A)+P(B)-P(A\cap B).
  $$

---

## Example: Intersection, Union and Mutually Exclusive (1/2)

A card was randomly drew from a deck of 52 cards.

1. What's the probability of getting a heart?
2. What's the probability of getting a face (king, queen, or jack)?
3. What's the probability of getting a heart face?
4. What's the probability of getting a heart or a face?
5. What's the probability of getting a club and spade?

.center[
  ![:resize Standard 52 card deck, 70%](data:image/png;base64,#Figures/Standard-52-card-from-Wikipedia.png)
<br>
]

.footmark[
  Image source: [Wikipedia: Standard 52-card deck](https://en.wikipedia.org/wiki/Standard_52-card_deck)
]
---

## Example: Intersection, Union and Mutually Exclusive (2/2)

**Solution:** The sample space `$S$` consists of 52 cards as shown in the above picture. Among hearts, there are 3 face card. Then
$$
P(\text{heart and face})=\frac{3}{52}=\frac{1}{14}.
$$

There are `$22$` cards that are hearts or faces. Then
$$
P(\text{heart or face})=\frac{22}{52}=\frac{11}{26}.
$$

Since there is no card which is club and spade, we have
$$
P(\text{club and spade})=0.
$$

**Note:** `$P(\text{heart})=\frac{13}{52}$` and `$P(\text{face})=\frac{12}{52}$`, the addition formula also gives
$$
P(\text{heart or face})=\frac{13}{52}+\frac{12}{52}-\frac{3}{52}=\frac{22}{52}=\frac{11}{26}.
$$

---

## Practice: Addition rule

---

## The Conditional Probability

- The **conditional probability** of `$A$` given `$B$`, written as `$P(A\mid B)$`, is the probability that event `$A$` will occur given that the event `$B$` has already occurred.

- In the case that the chance experiment has equally likely outcomes, the conditional probability is,
  $$
  P(A\mid B)=\dfrac{\text{numbers of elements in }A\cap B}{\text{number of elements in }B}.
  $$

- In general, we may use fundamental rules of probability and the multiplication rule to calculate the conditional probability.

---

## The Multiplication Rule

- **Multiplication Rule:** the probability of the intersection of two events `$A$` and `$B$` satisfies the following equality
  $$
  P(A\cap B)=P(B)P(A\mid B)=P(A)P(B\mid A).
  $$

- The multiplication rule gives a formula for conditional probability:
  $$
  P(B\mid A)=\dfrac{P(A\cap B)}{P(A)}\qquad\qquad P(A\mid B)=\dfrac{P(A\cap B)}{P(B)}.
  $$

---

## Independent Events

- Two events `$A$` and `$B$` are **independent** if
  $$
  P(A\mid B)=P(A)\quad \text{ or }
  \quad P(B)=P(B\mid A).
  $$
  Equivalently,
  $$
  P(A\cap B)=P(A)P(B).
  $$

- **Fundamental Counting Principle:** if there are `$m$` ways of doing something and `$n$` ways of doing another thing, then there are `$m\cdot n$` ways of performing both actions in order.

---

## Example: Conditional probability

A fair six-sided die is rolled.

Find the probability that the number rolled is a two, given that it is even.

**Solution:** Let `$A$` be the event of all possible even outcomes. Then
`$$A=\{2, 4, 6\}.$$`

Let `$B$` be the event consisting of the outcome 2. Then
`$$B=\{2\}.$$`

The intersection event `$A\cap B$` consists of the number `$2$`.

By the definition of conditional probability,
`$$P(B|A)=\dfrac{P(A\cap B)}{P(A)}=\dfrac{1}{3}.$$`

---

## Example: Fundamental counting principle and multiplication rule (1/2)

Consider flipping a fair coin and rolling a fair six-sided die together.

1. What's the probability that the coin shows a head?
2. Given that a head occurs, what's the probability that the die shows a number bigger than 4?
3. What's the probability of getting a head and a number bigger than 4?
4. Verify that flipping a head and rolling a number bigger than 4 are independent events.

**Solution:** By the fundamental counting principle, the sample space consists of `$2\times 6=12$` elements. Let `$H$` be the event of getting a head and `$D$` be the event getting a number bigger than 4.

Then
$$
H = \{ H1, H2, H3, H4, H5, H6 \}
$$
$$
D\cap H=\{H5, H6\} .
$$

---

## Example: Fundamental counting principle and multiplication rule (2/2)

**Solution: (continued)**  
The probability of getting a head is `$P(H)=\frac12$`.

Given that a head shows, the change of getting a number bigger than 4 is
`$$P(D\mid H)=\frac{2}{6}=\frac13.$$`

By the multiplication rule,
`$$P(H\cap D)=P(H)P(D\mid H)=\frac12\cdot\frac13=\frac16.$$`

Note that `$D=\{H5, T5, H6, T6\}$`. Then
`$$P(D)=\frac{4}{12}=\frac{1}{3}=P(D\mid H).$$`
Therefore, `$H$` and `$D$` are independent.

---

## Example: From multiplication rule to addition rule (1/2)

The probability that a student borrows a statistics book from the library is 0.3. The probability that a student borrows a biology book is 0.4. Given that a student borrowed a biology book, the probability that he/she borrows a statistics book is 0.6.

1. Find the probability that a student borrows a statistics book and a biology book.
2. Find the probability that a student barrows a statistics boor or a biology book.

---

## Example: From multiplication rule to addition rule (2/2)

**Solution:** Denote by `$S$` the event that a student borrows a statistics book, and `$B$` the event that the student borrows a biology book.

From the given conditions, we know that `$P(S)=0.3$`, `$P(B)=0.4$` and `$P(S\mid B)0.6$`.

By the multiplication rule, we know
$$
P(S\cap B)=P(S\mid B)P(B)=0.4\cdot 0.6=0.24.
$$

By the addition rule, we get
$$
`\begin{aligned}
  P(S\cup B)=&P(S)+P(B)-P(S\cap B)\\
  =&0.3+0.4-0.24=0.46.
\end{aligned}`
$$

---

## Sampling with Replacement or without Replacement

- **With replacement:** If each member of a population is replaced after it is picked, then that member has the possibility of being chosen more than once. When sampling is done with replacement, then events are considered to be independent, meaning the result of the first pick will not change the probabilities for the second pick.

- **Without replacement:** When sampling is done without replacement, each member of a population may be chosen only once. In this case, the probabilities for the second pick are affected by the result of the first pick. The events are considered to be dependent or not independent.

---

## Example: Drawing cards with replacement

Two cards were randomly drawn from a standard deck of 52 cards with replacement. Find the probability of getting exactly one club card.

**Solution:** There are two different pairs with exactly one club card: (club, not club), (not club, club).

When drawing with replacement, the events are considered to be independent. Therefore, the probability in those two situations are
`$$P(\text{(club, not club)})=P(\text{club})\cdot P(\text{not club})=\frac{13}{52}\cdot\frac{39}{52},$$`
`$$P(\text{(not club, club)})=P(\text{not club})\cdot P(\text{club})=\frac{39}{52}\cdot\frac{13}{52}.$$`
Then the probability of getting exactly one club is
`$$P(\text{exactly one club})=\frac{13}{52}\cdot\frac{39}{52}+\frac{39}{52}\cdot\frac{13}{52}=\frac{3}{8}.$$`

---

## Example: Drawing cards without replacement

Two cards were randomly drawn from a standard deck of 52 cards without replacement, which means the first card will not be put back.

- Find the probability that getting two spades.
- Find the probability that getting exactly one spade face card.

**Solution:** Let `$S1$` be the event of getting a spade in the first drawing and `$S2\mid S1$` be the event of getting the second spade given the first card is a spade. The probability `$P(S1)=\frac{13}{52}=\frac14$`. The probability of `$P(S2\mid S1)=\frac{12}{51}$`. Then the probability of getting two spades is
$$
P(S1 \text{ and } S2)=P(S1)P(S2\mid S1)=\frac{1}{4}\cdot\frac{12}{51}=\frac{4}{51}.
$$

Let `$NS1$` and `$NS2$` be events of not getting a spade card in first and second drawing respectively. The probability of getting exactly one spade card is
$$
P(S1 \text{ and } NS2)+P(NS1\text{ and } S2)= \frac{13}{52}\cdot \frac{39}{51} + \frac{39}{52}\cdot\frac{13}{51}=\frac{39}{102}.
$$

---

## Practice: Guess a password

---

## Practice: Conditional probability

A special deck of 16 cards has 4 that are blue, 4 yellow, 4 green, and 4 red. The four cards of each color are numbered from one to four. A single card is drawn at random. Find the following probabilities.

1. The probability that the card drawn is red.
2. The probability that the card is red, given that it is not green.
3. The probability that the card is red, given that it is neither red nor yellow.
4. The probability that the card is red, given that it is not a four.

.footmark[
Source: [https://saylordotorg.github.io/text_introductory-statistics/s07-03-conditional-probability-and-in.html](https://saylordotorg.github.io/text_introductory-statistics/s07-03-conditional-probability-and-in.html)
]

---

## Practice: Conditional probability subject to complement

---

## Practice: Pens drawn from a box without replacement

A box contains 10 pens, 6 black and 4 red. Two pens are drawn without replacement, which means that the first one is not put back.

- What is the probability that both pens are red?
- What is the probability that at most one pen is red?
- What is the probability that at least one pen is red?

---

## Practice: Classical question on basic rules of probability

.embedded[
<iframe src="https://www.myopenmath.com/embedq2.php?id=162779&seed=2020&showansafter" height="550px">
</iframe>
]

---
class: center middle topic

# Discrete Random Variables

---

## Learning Goals for Discrete Random Variables

- Demonstrate understanding of random variables

- Demonstrate understanding of characteristics of binomial distributions.

- Calculate accurate probabilities of discrete random variables and interpret them in a variety of settings.

---

## Random Variables

- A **random variable**, usually written `$X$`, is a variable whose values are numerical quantities of possible outcomes a random experiment.

- A **discrete random variable** takes on only a finite or countable number of distinct values. For example,

- Rolling a fair dice, the number of dots on the top faces is a discrete random variables takes on the possible values: 1, 2, 3, ,4, 5, 6.
  - Flipping a fair coin 10 times, the number of heads is a discrete random variable takes on the possible values: 1, 2, 3, ..., 10.

- A **continuous random variable** takes on values which form an interval of numbers. For example,

- The height of an randomly select 10 year-old boy in US is normally between 129 cm and 157 cm. So the height is a continuous random variable.
  - The measure the voltage at an randomly electrical outlet normally is between 118 and 122. SO the measure of voltage is a continuous random variable.

---

## Practice: Discrete or continuous

Classify each random variable as either discrete or continuous.

1. The number of boys in a randomly selected three-child family.

2. The temperature of a cup of coffee served at a restaurant.

3. The number of math majors in randomly selected group of 10 students.

4. The amount of rain recorded in a small town one day.

---

## Practice: Possible values of the variable

Identify the set of possible values for each random variable. (Make a reasonable estimate based on experience, where necessary.)

1. The sum of numbers on the top of two fair dice.

2. The waiting time of a randomly selected customer at a restaurant.

---

## Probability Distributions

- The **probability distribution** of a discrete random variable `$X$` is defined by the probability associated with each possible value of `$X$`.

- A probability distribution of a discrete random variable is usually  characterized by a table of all possible values `$x$` together with probabilities `$P(x)$`, or a probability histogram, or a formula.

- A random variable `$X$` (discrete and continuous) always has a **cumulative distribution function**: `$F_X(x)=P(X\leq x)$` (= `$\sum_{x_i\leq x} P(x_i)$` if `$X$` is discrete).

---

## Basic Properties of Probability Distributions

- Recall basic rules of probability:

- `$0\leq P(X=x)\leq 1$`.
  
  - the sum of all the probabilities is 1, that is `$P(X\leq x_{max})=1$`.
  
  - In particular, `$0\leq F_X(x)\leq 1$`.

- The probability distribution can be recovered from its cumulative distribution function. Indeed, for a discrete random variable `$X$`, we have
  `$$P(X=x_i)=P(X\le x_i)-P(X\le x_{i-1}),$$`
  where `$P(X\le x_i)=\sum_{k=1}^iP(X=x_k)$`.

---

## Example: Probability distribution of flipping two fair coins

Let `$X$` be the number of heads that are observed when tossing two fair coins.

1. Construct the probability distribution for `$X$`.
2. Find `$P(X\le 1)$` and `$P(X\le 2)$`.

**Solution:** The possible values of numbers of head are `$0$`, `$1$` and `$2$`. The probability distribution can be characterized by the following table:
.center[
| `$X$`   | 0   | 1   | 2   |
|-----|-----|-----|-----|
| `$P(X)$` |0.25 | 0.5 | 0.25|
]

From the table, we may find the following cumulative distributions:
`$$P(X\leq 1)=P(X=0)+P(X=1)=0.25+0.5=0.75.$$`
`$$P(X\leq 2)= P(X=0)+P(X=1)+P(X=2)=0.25+0.5+0.25=1.$$`

---

## Example: Probability Histogram of an Unfair Coin

The probability distribution of an unfair coin is characterized by the following histogram.  
.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-38-5.png" width="360" />
]

**Solution:** Let `$X$` be the number of heads. From the probability histogram, we know that `$P(X=0)=0.36$`, and `$P(X=1)=0.47$`.

Then the probability of getting at most 1 head is
$$
P(X\le 1)=P(X=0)+P(X=1)=0.36+0.47=0.83.
$$

---

## Practice: Probability distribution of rolling a pair of fair dice

A pair of fair dice is rolled. Let `$X$` denote the sum of the number of dots on the top faces.

1. Construct the probability distribution of `$X$`.
2. Find the probability that `$X$` takes an odd value.

---

## Practice: Probability distribution from a histogram

---

## Mean and Standard Deviation of a Discrete Random Variable

- The the **expected value** `$E(X)$` (also called **mean** and denoted by `$\mu$`) of a discrete random variable `$X$` is the number `$$\mu=E(X)=\sum xP(x).$$`

- The **variance** `$Var(X)$` (also denoted by `$\sigma^2$`) of a discrete random variable `$X$` is the number `$$\sigma^2=Var(E)=\sum (x-\mu)^2P(X).$$`

- The **standard deviation** `$\sigma$` of a discrete random variable `$X$` is the square root of its variance: `$$\sigma=\sqrt{\sum (x-\mu)^2P(X)}.$$`

---

## Some Properties of Expected Value (Optional)

- The expected value of a linear combination two random variables `$X$` and `$Y$` is the linear combination of their expected values, that is `$$E(aX+bY)=aE(X)+bE(Y).$$`

- The expected value in general is not multiplicative, that is `$E(XY)\ne E(X)E(Y)$`.

- If the two random variable `$X$` and `$Y$` are independent, then `$$E(XY)=E(X)E(Y).$$`

- The variance can be computed using expected values:
  `$$\mathrm{Var}(X)=E(X^{2})-(E[X])^{2}.$$`

- Let `$X$` and `$Y$` be two *independent* variables. Then the variance of a linear combination `$aX+bY$` equals
`$$Var(aX+bY)=a^2Var(X)+b^2Var(Y)$$`

---

## Example: Expected Gain

One thousand raffle tickets are sold for $2 each. Each has an equal chance of winning. First prize is $500, second prize is $300, and third prize is $100. Find the expected value of gain, and interpret its meaning.

**Solution:** Let `$X$` denote the net gain from purchasing one ticket. The probability distribution for `$X$` is  
.center[
| `$X$`    | 498    | 298    | 98     | -2       |
|------|--------|--------|--------|----------|
| `$P(X)$` | `$\frac{1}{1000}$` | `$\frac{1}{1000}$` | `$\frac{1}{1000}$` | `$\frac{997}{1000}$` |
]

The expected gain is `$$E(X)= 498\cdot \frac{1}{1000}+ 298\cdot\frac{1}{1000}+98\cdot \frac{1}{1000}+(-2)\cdot \frac{997}{1000}=-1.1,$$` which means that when buying one ticket, the buyer may expect a loss of $1.1.
  
---

## Example: Waiting Time

The wait times (rounded to multiples of 5) in the cafeteria at a Community College has the following probability distribution. Find the expected waiting time and the standard deviation.  
.center[
| `$X$` (minutes) | 5        | 10       | 15      | 20       | 25       |
| -------------------- | -------- | -------- | -------- | -------- | -------- |
| `$P(X)$`             | 0.13 | 0.25 | 0.31 | 0.21 | 0.1 |
]

**Solution:** The expected waiting time is
`$$\mu= 5\cdot 0.13 +10\cdot 0.25+15\cdot 0.31+20\cdot 0.21 + 25\cdot 0.1 = 14.5.$$`
The standard deviation is then
`$$\scriptstyle \sigma=\sqrt{(5-14.5)^2\cdot 0.13 +(10-14.5)^2\cdot 0.25+(15-14.5)^2\cdot 0.31+(20-14.5)^2\cdot 0.21 + (25-14.5)^2\cdot 0.1}\approx 5.9.$$`

---

## Example: Unfair Die (1/2)

The probability distribution of an unfair die is given in the following table.

| `$X$` | 1| 2| 3| 4| 5| 6|
|---|---|---|---|---|---|
| `$P(X)$` | 0.18| 0.12| `$\,?\,$` |0.14|0.23|0.17|

1. Find `$P(X=3)$`.
2. Find the mean, variance and standard deviation of this probability distribution.

**Solution:** Since the sum of probabilities must be 1, we know that
$$
P(X=3)=1-(0.18+0.12+0.14+0.23+0.17)=0.16
$$

---

## Example: Unfair Die (2/2)

**Solution: (Continued)** The mean is the weighted sum
$$
\scriptstyle
\mu=0.18\cdot 1+0.12\cdot 2+0.16\cdot 3+0.14\cdot 4+0.23\cdot 5+0.17\cdot 6=3.63.
$$

The variance is
$$
\scriptstyle
\sigma^2=0.18(1-3.63)^2+0.12(2-3.63)^2+0.16(3-3.63)^2+0.14(4-3.63)^2+0.23(5-3.63)^2+0.17(6-3.63)^2=3.0331.
$$

Therefore, the standard deviation is
$$
\sigma=\sqrt{\sigma^2}=\sqrt{3.0331}\approx 1.7416.
$$

---

## Practice: Lottery tickets

Seven thousand lottery tickets are sold for $5 each. One ticket will win $2,000, two tickets will win $750 each, and five tickets will win $100 each. Let `$X$` denote the net gain from the purchase of a randomly selected ticket.

1. Construct the probability distribution of `$X$`.
2. Compute the expected value `$E(X)$` of `$X$`. Interpret its meaning.
3. Compute the standard deviation `$\sigma$` of `$X$`.

.footmark[
Source: [https://saylordotorg.github.io/text_introductory-statistics/s08-02-probability-distributions-for-.html](https://saylordotorg.github.io/text_introductory-statistics/s08-02-probability-distributions-for-.html)
]

---

## Binomial Distribution

- A **binomial experiment** is a probability experiment satisfying:

1. The experiment has a fixed number `$n$` of independent trials.
  2. Each trial has only two possible outcomes: a success (S) or a failure (F).
  3. The probability `$p$` of a success is the same for each trial.

- The discrete random variable `$X$` counting the number of successes in the `$n$` trials is the **binomial random variable**. We say `$X$` has a **binomial distribution** with parameters `$n$` and `$p$` and write it as `$X\sim B(n, p)$`.

- For `$X\sim B(n, p)$`, the **probability of getting exactly `$x$` successes in `$n$` trials** is `$$P(X=x)=B(x,n,p)={_n C_x} p^x(1-p)^{n-x}=\frac{n!}{(n-x)!x!}p^x(1-p)^{n-x}.$$`

- The notation `$n!=n(n-1)\cdots 1$` is read as `$n$` factorial. We set `$0!=1.$`

- The notation `${_n C_x}=\frac{n!}{(n-x)!x!}$` is read as `$n$` choose `$x$`, which is the number of ways to choose `$x$` objects from a set of `$n$` objects.

---

## Example: Probability from drawing card multiple times (1/2)

A card is selected from a standard deck and replaced. This experiment is repeated a total of `$5$` times.

- Find the probability of selecting exactly `$3$` clubs.
- Find the probability of getting at least `$3$` clubs.

**Solution:** This is a binomial experiment. The number to total trial is `$n=5$`. The number of success is `$3$`. The chance of a success is `$p=\frac{13}{52}=\frac14$`. Apply the binomial probability formula, we have
`$$P(X=3)=\frac{5!}{3!2!} \left(\frac{1}{4}\right)^3\left(\frac34\right)^2=10\cdot\frac{9}{4^5}\approx 0.088.$$`
The probability `$P(X=3)$` can also be found from the binomial distribution table or by using the Excel function `BINOM.DIST(3,5,1/4,FALSE)`.

To probability of getting at least `$3$` club is
`$$P(X\geq 3) =1-P(X\leq 2)=1-(P(0)+P(1)+P(2))\approx 1-0.8965=0.1035.$$`

---

## Example: Probability from drawing card multiple times (2/2)
  
**Solution:(continued)**
To calculate `$P(X\leq 2)$`, we may also use the binomial distribution table or the Excel function `BINOM.DIST()`.

**Method 1:** As `$n=5$` and `$p=0.25$`, we use the following portion of the cumulative binomial distribution table.
.center[Binomial Probability Table *n=5*

| n    | x    | 0.1    | 0.15   | 0.2    | 0.25   | 0.3    | 0.35   | 0.4    |
| ---- | ---- | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| 5    | 0    | 0.5905 | 0.4437 | 0.3277 | .red[0.2373] | 0.1681 | 0.116  | 0.0778 |
| 5    | 1    | 0.3281 | 0.3915 | 0.4096 | .red[0.3955] | 0.3602 | 0.3124 | 0.2592 |
| 5    | 2    | 0.0729 | 0.1382 | 0.2048 | .red[0.2637] | 0.3087 | 0.3364 | 0.3456 |
| 5    | 3    | 0.0081 | 0.0244 | 0.0512 | 0.0879 | 0.1323 | 0.1811 | 0.2304 |
]

`$P(X\le 2) \approx 0.2373+0.3955+0.2637= 0.8965.$`

**Method 2:** In Excel, `$P(X\le 2)$` `=BINOM.DIST(2,5,0.25,TRUE)` `$\approx 0.8965$`.

???
In a calculator, use `binompdf(n, p, x)` for `$P(X=x)$` and `binomcdf(n, p, x)` for `$P(X\leq x)$`.

---

## Practice: Find probability of from a binomial distribution

Let `$X$` be a binomial random variable with parameters `$n = 5$`, `$p=0.2$`. Find the probabilities
`$\quad \text{1.}\,\, P(X=3) \qquad \text{2.}\,\, P(X<3)\qquad \text{3.}\,\, P(X>3).$`

---

## Practice: Machine defect rate

---

## Mean and Standard Deviation of Binomial Distribution (4/5)

- The mean of a binomial distribution of `$n$` trials is `$$\mu =\sum xP(x)=\sum x\cdot \dfrac{n!}{(n-x)!x!}p^x(1-p)^{n-x} = np.$$`

- The variance of a binomial distribution of `$n$` trials is `$$\sigma^2 =\sum (x-np)^2P(x)=\sum x^2P(x)-(np)^2=np(1-p).$$`

- The variance of a binomial distribution of `$n$` trials is `$$\sigma=\sqrt{np(1-p)}.$$`

- We consider an event `$E$` **unusual** if the probability `$P(E)\leq 5\%$`.

---

## Example: Expected value and SD of cracked eggs

The probability that an egg in a retail package is cracked or broken is 0.02.

1. Find the average number of cracked or broken eggs in a one dozen carton.
2. Find the standard deviation.
3. Is getting at least two broken eggs unusual?

**Solution:** Since there are 12 eggs and the chance of getting a cracked egg is 0.02, the average number of cracked is
`$$\mu =np=12\cdot 0.02=0.24.$$`

The standard deviation is
`$$\sigma=\sqrt{12\cdot 0.02\cdot(1-0.02)}\approx 0.4850.$$`

Recall the Empirical rule: 95% data are within 2 standard deviation away from the mean. Since `$2>0.24+2\cdot 0.4850$`, the chance of getting at least two cracked eggs is less than 5%, which is considered as unusual.

---

## Practice: Quality of grapefruit

Adverse growing conditions have caused 5% of grapefruit grown in a certain region to be of inferior quality. Grapefruit are sold by the dozen.

1. Find the average number of inferior quality grapefruit per box of a dozen.

2. A box that contains two or more grapefruit of inferior quality will cause a strong adverse customer reaction. Find the probability that a box of one dozen grapefruit will contain two or more grapefruit of inferior quality.

---

## Practice: Mean and SD of a binomial distribution

---

## Excel Functions for Binomial

Let `$X$` be a binomial random variable with parameters `$n$` and `$p$`, that is `$X\sim B(n, p)$`. In Excel, `$P(X=x)$` is given by `BINORM.DIST(x, n, p, FALSE)` and `$P(X\le x)$` is given by `BINORM.DIST(x, n, p, TRUE)`. You may click input function `$f_x$` and then search `binorm` to find the function.

---

## Practice: Find the mean of a discrete probability distribution

---

## Practice: Find the standard deviation of a discrete probability distribution

---

## Practice: The number of sales of new employees

.pull-left[
A company tracks the number of sales new employees make each day during a 100-day probationary period. The results for one new employee are shown at the right.

1. Find the probability of each outcome.
  2. Construct a probability distribution table.
  3. Find the mean of the probability distribution.
  4. Find the variance and standard deviation.
]<br>
.pull-right[

| Sales per day `$x$` | Number of days `$f$` |
|-------------------|--------------------|
| 0                 | 16                 |
| 1                 | 19                 |
| 2                 | 15                 |
| 3                 | 21                 |
| 4                 | 9                  |
| 5                 | 10                 |
| 6                 | 8                  |
| 7                 | 2                  |
]

---

## Practice: Probability within one SD of a discrete probability distribution

---

## Practice: Probability from a poll

---

## Practice: Chance of continuous successes of a type of surgery

A type of surgery has a 90% chance of success. The surgery is performed on three patients. Find the probability of the surgery being successful on exactly two patients.

---
class: center middle topic

# Continuous Random Variables

---

## Learning Goals for Probability and Probability Distribution

- Demonstrate understanding of characteristics of normal distributions.

- Calculate accurate probabilities of continuous random variables and interpret them in a variety of settings.

- Calculate the standardized value (or `$z$`-score).

---

## Probability Distribution of a Continuous Random Variable

- The probability distribution of a continuous random variable `$X$` is characterized by its **probability density function** `$f(X)$` satisfying that the probability `$P(a\leq X\leq b)$` equals the area above the interval `$[a, b]$` but under the graph of the density function `$f(X)$` which is also called a **density curve**.

.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-39-8.png" width="360" />
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-40-1.png" width="360" />
]

---

## Properties of Probability Distribution of a Continuous Random Variable

- The probability density function `$f$` is nonnegative, that is `$f(X)\ge 0$`.

- The total area under a density curve is 1.

- The cumulative probability `$P(X\le b)$` of a random variable `$X$` equals the area under the density curve to the left side of `$b$`.

- By the addition rule of probability, we have
  - `$$P(a\le X\le b)=P(X\le b)-P(X\le a)$$`
  - `$$P(X\ge b)=1-P(X\le b)$$`

- As a line segment has no area, we have `$P(X\le a)=P(X< a)$` as well as `$P(X\ge b)=P(X>b)$`

---

## Example: An Uniform Distribution

Let `$X$` be the amount of time that a commuter must wait for a train. Suppose `$X$` has a probability density function
$$
f(X)=
`\begin{cases}
  0.1, & 0\leq X\leq 10\\
  0,   & \text{otherwise}
\end{cases}`
$$

What is the probability that the commuter's waiting time is less than 4 minutes?

**Solution:** The probability `$P(X\leq 4)$` is the area under the horizontal line `$y=0.1$` to the left of `$X=5$`. Since `$f(X)=0$` for `$X<0$`, the area is the area of the rectangle with width 4 and height 0.1. So the probability is `$P(X\leq 4)=0.1\cdot 4=0.4$`.

.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-41-1.png" width="360" />
]

---

## Normal Distribution

- A **normal distribution** has a **density function** `$f(x)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}},$`
where `$\mu$` is the mean, `$\sigma$` is the standard deviation, `$\pi\approx 3.14159$` and `$e\approx 2.71828$`. The graph of `$f$` is called a **normal curve**.

- We write `$X\sim \mathcal{N}(\mu, \sigma^2)$` for a normal random variable `$X$` with the mean `$\mu$` and the standard deviation `$\sigma$`.

- A normal distribution has the following properties:
  - *The mean, median, and mode are equal*.
  - The normal curve is *bell shaped and __symmetric__* with respect to the mean.
  - The *total area* under the curve and above the `$x$`-axis is `$1$`.
  - The normal curve *approaches, but never touches, the `$x$`-axis* as `$x$` goes to `$\pm\infty$`.
  - Between `$\mu-\sigma$` and `$\mu+\sigma$`, the graph *curves downward*. On the left side of `$\mu-\sigma$` or the right side of `$\mu+\sigma$`, the graph *curves upward*.  A point at which the curve changes the direction of curving is called an **inflection point**.

---

## Normal Curves with Different Means and Standard Deviations

.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-42-1.png" width="504" />
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-43-1.png" width="504" />

]

---

## The Empirical Rule for Normal Distributions

For any normal distribution, the proportion of data values within 1, 2, and 3 standard deviations away from the mean are approximately 68.3%, 95.4% and 99.7% respectively.

]

---

## Example: Foot length (1/2)

Suppose that foot length of a randomly chosen adult male is a normal random variable with the mean `$\mu=11$` and the standard deviation `$\sigma=1.5$`.

- How likely is a male's foot length to be smaller than 9.5 inches
- How likely is a male's foot length to be bigger than 8 inches

**Solution:** Let's first sketch the normal curve.

.center[
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-45-1.png" width="432" />
<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-46-1.png" width="432" />
]

---

## Example: Foot length (2/2)

**Solution: (Continued)**

Note that `$9.5=11-1.5=\mu-\sigma$`. By the symmetry of normal curve, we know that the probability `$P(X<9.5)$` is the shaded area on the left. Because the probability of getting a foot length within 1 standard deviation away from the mean is 0.683. Then
`$$\scriptstyle P(X<9.5)=\frac12(1-P(9.5<X<12.5))\approx\frac12(1-0.683)=0.1585.$$`

Note that `$8=11-2\cdot 1.5=\mu-2\sigma$`. Because the probability of getting a foot length within 2 standard deviation away from the mean is 0.954. Then
`$$\scriptstyle P(X>8)=(1-P(X<8))=1-\frac12(1- P(8<X<14))=1-\frac12(1-0.954)=0.977.$$`

---

## Standard Normal Distribution

- A normal distribution is called a **standard normal distribution** if the mean is `$\mu=0$` and the standard deviation is `$\sigma=1$`.

- A random normal variable can be **standardized** by the following formula `$z=\frac{x-\mu}{\sigma}.$` We call the value `$z$` the `$Z$`-**score** of `$x$`. In Excel, the `$Z$`-score of `$x$` can be calculated using the function `STANDARDIZE()`.
  
- Standardization preserves probability:
  `$$P(a<X<b)=P\left(\frac{a-\mu}{\sigma}< Z < \frac{b-\mu}{\sigma}\right).$$`

- The probability `$P(Z< z)$` of a standard normal random variable `$Z$` can be found using the Excel function `NORM.S,DIST(z, TRUE)` or the [standard normal distribution table](https://yfei.page/teaching/statistics/normal-tables.html).

- The probability `$P(X< x)$` of a normal random variable `$X$` can be calculated using the Excel function `NORM.DIST(x, mean, sd, TRUE)`.

---

## Example: Find the Standard Score

Let `$X$` be a norma random variable with the mean `$\mu = 8$` and the standard deviation `$\sigma=2$`.

1. Find the `$Z$`-score for the value `$X=13$`.
2. Find the `$X$`-value for the `$Z$`-score `$z=-0.6$`.

**Solution:** The `$z$`-score for the value `$X=13$` is
`$$z=\dfrac{x-\mu}{\sigma}=\dfrac{13-8}{2}=\dfrac{5}{2}=2.5.$$`

The `$X$`-value for the the `$Z$`-score `$z=-0.6$` is
`$$x=z\cdot\sigma+\mu=-0.6\cdot 2+8=-1.2+8=6.7.$$`

---

## Example: Probability of a Standard Normal Random Variable (1/2)

Let `$Z$` be a standard normal random variable.

.pull-left[
1. Find `$P(Z<1.21)$`.
2. Find `$P(Z\geq 1.21)$`.
3. Find `$P(0<Z\leq 1.21)$`.
]
.pull-right[
| Z   | 0      | 0.01   | 0.02   |
|-----|--------|--------|--------|
| 1.2 | .red[0.8849] | 0.8869 | 0.8888 |
| 1.3 | 0.9856 | 0.9856 | 0.9857 |
]

**Solution:** To find the probability, we may use the standard normal distribution table, or the Excel function `NORM.S.DIST(z,TRUE)`.

1. From the table, we see that `$P(Z<1.21)\approx 0.8869$`.
2. Since the total area under the normal curve is 1, we get
  `$$P(Z\geq 1.21)\approx 1-0.8869=0.1131.$$`
3. By the symmetry, `$P(Z<0)=0.5$`. Then the probability
  `$$P(0<Z<1.21)\approx 0.8869-0.5=0.3869.$$`

---

## Example: Heights of 25-year-old women

The heights of 25-year-old women in a certain region are approximately normally distributed with mean 62 inches and standard deviation 4 inches. Find the probability that a randomly selected 25-year-old woman is more than 67 inches tall.

**Solution:** Let's first sketch the normal curve.
.pull-left[

<img src="data:image/png;base64,#MA336-Slides_files/figure-html/unnamed-chunk-47-1.png" width="432" />
]
.pull-right[
| Z    | 0.04   | 0.05   | 0.06   |
| ---- | ------ | ------ | ------ |
| 1.2  | 0.8925 | .red[0.8944] | 0.8962 |
]

The probaiblity is `$P(X>67)=1-P(X<67)$`. To calculate `$P(X<67)$`, one way is to use the standard normal distribution table. First find the `$Z$`-score is `$z=\frac{67-62}{4}=1.25$`. Then `$P(Z<1.25)\approx 0.8944$`.

Another way is to use the Excel function `NORM.DIST(67, 62, 4, TRUE)`.

Then `$P(X>67)\approx 1-0.8944=0.1056$`.

---

## Cutoff Value for a Given Tail Area

- The `$k$`-th percentile for a random variable `$X$` is the value `$x_k$` that cuts off a left tail with the area `$k/100$`, that is `$P(X<x_k)=\frac{k}{100}$`, where `$0\leq k\leq 100$`.

- Let `$c$` be a nonnegative number less than or equal to 1. The `$(100c)$`-th percentile for the standard normal distribution is usually denoted as `$-z_c$`, that is `$P(Z<-z_c)=c$`. By symmetry, `$z_c$` is the value such that `$P(Z> z_c)=c$`, that is `$P(Z<z_c)=1-c$`.

- For a noraml random variable `$X$` with the mean `$\mu$` and standard deviation `$\sigma$`, the cutoff value `$x^*$` with a **tail area** `$c$`, can be calculated using the standardization formula, that is,
  `$$x^*=z^*\cdot \sigma+\mu,$$`
  where `$z^*$` is the cutoff `$z$`-score with the tail area `$c$`, that is `$z^*=-z_c$` given that `$c$` is the left-tail area and `$z^*=z_c$` given that `$c$` is the right tail area.

---

## Example: Cutoff Value for a Normal Random Variable

Let `$X$` be the normal random variable with mean `$6$` and standard deviation `$3$`. Suppose the value `$x^*$` cuts off a left-tail area `$0.05$`. Find the value `$x^*$`.

**Solution:** One way to find the value `$x^*$` is to use the Excel function `NORM.INV(0.05, 6,3)`:
`$$x^* \approx 1.065.$$`

Another way is to use the standardization formula. Using the standard normal distrution table or the Excel function `NORM.S.INV(0.05)`, we find that `$-z_{0.05}=-1.645$`. Then
`$$x^*=-z_{0.05}\cdot 3+6=1.065.$$`

| z    | − 0.04 | − 0.05  |
|------|--------|---------|
| -1.6 | 0.0505 | 0.04947 |

.footmark[
**Note:** if the value `$c$` is between two cells in the standard deviation table, we take `$z^*$` be the average of the two `$z$`-scores associated to the values in the two cells.
]

---

## Example: Math Course Placement

Scores on a standardized college placement examination are normally distributed with mean 60 and standard deviation 13. Students whose scores are in the top 5% will be placed in a Calculs II course. Find the minimum score needed to be placed in a Calculus II course.

**Solution:** Let `$x^*$` be the minimum score. From the question, we know that `$P(X\geq x^*)=0.05$`. Equivalently, `$P(X<x^*)=1-0.05=0.95$`.

Using the function `NORM.INV(0.95, 60, 13)`, we find the `$x^*$` score is
`$$x^*=81.38.$$`

Another way is to find `$z_{0.05}$` first, then use the standardization formula. Use the standard normal distribution table or the Excel function, you will find that the `$z$`-score is `$z^*=z_{0.05}=1.64$`. Then
`$$x^*=z^*\sigma+\mu=1.64\cdot 13+60=81.32$$`

So the minimum score needed is `$82$`.

---

## Practice: Dash washing time

---

## Practice: Find probabilities of a normal random variable

1. Let `$Z$` be a standard normal random variable. Find the probabilities:
  `$$\text{1.}\,\, P(Z<1.58)\quad \text{2.}\,\,  P(-0.6<Z<1.67)\quad \text{3.}\,\, P(Z>0.19).$$`

2. Let `$X$` be a normal random variable with `$\mu=5$` and `$\sigma=2$`. Find the probabilities:
  `$$\text{1.}\,\,  P(-2<X<8)\quad \text{2.}\,\, P(X>-1) \quad \text{3.}\,\, P(X<4).$$`

---

## Practice: Fruit weight

---

## Practice: Shortest lifespan

---

## Practice: Battery life

---

## Practice: Sum of probabilities of two normal random variables

Let `$Z$` be a normal random variable with `$\mu=0$` and `$\sigma=1$`. Let `$X$` be a normal random variable with `$\mu=4.3$` and `$\sigma=1.7$`.

Determine the values `$P(Z>1) + P(X<6)$` and explain how do you find the value.

---

## Practice: Area under a normal curve

---

## Practice: Find cutoff `$Z$`-scores

---

## Practice: Numbers of Chocolate Chips in Acceptable Cookies

---

## Practice: Blood Pressure

---

## Lab: Excel Functions for Normal Distributions

- Let `$Z$` be a standard normal random varaible. In Excel, `$P(Z<z)$` is given by `NORM.S.DIST(z, TRUE)`.

- Let `$X$` be a normal random variable with mean `$\mu$` and standard deviation `$\sigma$`, that is `$X\sim \mathcal{N}(\mu, \sigma^2)$`. In Excel, `$P(X<x)$` is given by `NORM.DIST(x, mean, sd, TRUE)`.

- When a cumulative probability `$p=P(X<x)$` of a normal random variable `$X$` is given, we can find `$x$` using `NORM.INV(p, mean, sd)`.

- When a cumulative probability `$p=P(Z<z)$` of a standard normal random variable `$Z$` is given, we can find `$z$` using `NORM.S.INV(p)`.

---
class: center middle topic

# Sampling Distributions

---

## Learning Goals for Sampling Distribution

- Demonstrate understanding of the sampling distribution of a statistic.

- Explain how the central limit theorem applies in inference.

- Determine whether a sampling distribution is approximately a normal distribution.

- Calculate key characteristics (mean, standard error) of the sampling distribution of a statistic.

- Estimate the probability of an event using the sampling distribution.

---

## Sampling Distribution

- When using sample statistics to estimate population parameter, there will be a chance error
  `$$\text{Population Parameter}=\text{Sample Statistic}+\text{Chance Error}.$$`

- To understand the chance error, we need to know how sample statistics distribute. Consider samples of the same size `$n$` randomly chosen from the population with replacement.

- The probability distribution of a sample statistic is called a **sampling distribution**.

---

## Visualization: Sampling Distribution from a Discrete Random Variable

.footmark[
Source: [https://istats.shinyapps.io/SampDist_discrete/](https://istats.shinyapps.io/SampDist_discrete/)
]

---

## Visualization: Sampling Distribution from a Continuous Random Variable

.footmark[
Source: [https://istats.shinyapps.io/sampdist_cont/](https://istats.shinyapps.io/sampdist_cont/)
]

---

## Sample Size Affects Standard Error

- The sampling distribution varies as the sample size changes. In general, A larger sample size will result a smaller standard deviation of the sampling distribution.

- The standard deviation of a sampling distribution is also called the **standard error**.

---

## Central Limit Theorem for Mean

- **The Central Limit Theorem:**

As the sample size `$n$` increases, the sampling distribution of the sample mean, from a population with the mean `$\mu$` and the standard deviation `$\sigma$`, will approach to a normal distribution with the mean `$\mu_{\bar{X}}=\mu$` and the standard deviation `$\sigma_{\bar{X}}=\dfrac{\sigma}{\sqrt{n}}$`.

- **Remark:** In terms of standardization, the central limit theorem says that the random variable `$\bar{Z}=\dfrac{\bar{x}-\mu}{\sigma/\sqrt{n}}$` has an approximately standard normal distribution.

---

## Visualization: Central Limit Theorem for Mean

.footmark[
  Source: [https://seeing-theory.brown.edu/probability-distributions/index.html#section3](https://seeing-theory.brown.edu/probability-distributions/index.html#section3)
]

---

## Required Sample Size to Apply the Central Limit Theorem for Mean

- For most distributions (not highly skewed), when sample size `$n>30$`, the sampling distribution of the sample mean `$\bar{X}$` can be approximated reasonably well by a normal distribution. The larger the sample size, the better the approximation will be.

- When the population is normally distributed, the sampling distribution of the sample means will be normally distributed for any sample size.

- If the population distribution is highly skewed, relying on CLT can be risky.

???
- [See the discussion on intuitive explanation.](https://stats.stackexchange.com/questions/3734/what-intuitive-explanation-is-there-for-the-central-limit-theorem/3904#3904)

---

## Example: Sampling Distribution of a Small Data Set (1/2)

Randomly draw samples of size 2 with replacement from the numbers 1, 3, 4.

- List all possible samples and calculate the mean of each sample.
- Find the mean, and standard deviation of the sample means.
- Find the mean, and standard deviation of the population.

**Solution:** Using the Excel function `AVERAGE()`, we may find means of samples and means of sample means.

Using the Excel function `STDEV.P()`, we may find the standard deviation of the population and the standard deviation of sample means.

.left-column[
  | `$\color{red}{\mu}$` | `$\color{red}{\sigma}$` | `$\color{blue}{\mu_{\bar{X}}}$` | `$\color{blue}{\sigma_{\bar{X}}}$` |
  | --------- | ----- | ----- | ----- |
  | .red[2.7]  | .red[1.25] |.blue[2.7]|.blue[0.88]|
]

.right-col[
  | sample   | (1,1) | (1,3) | (1,4) | (3,1) | (3,3) | (3,4) | (4,1) | (4,3) | (4,4) |
  | --------- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
  | `$\bar{X}$` | 1     | 2     | 2.5   | 2     | 3     | 3.5   | 2.5   | 3.5   | 4     |
]

]

It can be verified that `$\mu_{\bar{X}}=\mu$` and `$\sigma/\sqrt{n}=1.25/\sqrt{2}\approx 0.88=\sigma_{\bar{X}}$`.

---

## Example: Sampling Distribution of a Small Data Set (2/2)

**Solution: (Continued)** The following are the distribution of the population and the distribution of sample means.

]
]

]
]

---

## Example: Mean Length of Time on Hold

**Example:** Suppose the mean length of time that a caller is placed on hold when telephoning a customer service center is 23.8 seconds, with standard deviation 4.6 seconds. Find the probability that the mean length of time on hold in a random sample of 1,000 calls will be within 0.5 second of the population mean.

**Solution:** Since the sample size `$n=1000>30$` is large enough, by the Central Limit Theorem, we know the mean length of time is approximately normally distributed.

The mean of the sampling distribution is `$\mu_{\bar{X}}=\mu=23.8$`.

The standard deviation of the sampling distribution is `$\mu_{\bar{X}}=\dfrac{\sigma}{\sqrt{n}}=\dfrac{4.6}{\sqrt{1000}}\approx 0.15.$`

By the Excel function `NORM.DIST(xbar, mean, sd, true)`, the probability is calculated as
$$
`\begin{aligned}
  &P(23.8-0.5<\bar{X}<23.8+0.5)\\
  =&P(\bar{X}<24.3)-P(\bar{X}<23.3)\approx 0.9997-0.0003=0.9994.
\end{aligned}`
$$

---

## Example: Normal Distribution vs Sampling of Normal Distribution

Suppose speeds of vehicles on a particular stretch of roadway are normally distributed with mean 36.6 mph and standard deviation 1.7 mph.

- Find the probability that the speed `$X$` of a randomly selected vehicle is between 35 and 40 mph.
- Find the probability that the mean speed `$\bar{X}$` of 10 randomly selected vehicles is between 35 and 40 mph.

**Solution:** In this example, the population is normally distributed. So the sampling distribution of the sample mean is always normally distributed. For calculation, the Excel function `NORM.DIST(X, mean, sd, true)` will be used.

As `$\mu=36.6$` and `$\sigma=1.7$`, the probability that the speed of a vehicle is between 35 and 40 is `$P(35< X< 40)=P(X< 40)-P(X<35)\approx 0.9772-0.1733=0.8039$`

The mean of the sampling distribution is `$\mu_{\bar{x}}=\mu=36.6$`. The standard deviation of the sampling distribution is `$\sigma_{\bar{X}}=\sigma/\sqrt{n}=1.7/\sqrt{10}\approx 0.54.$` Then the probability is
`$$P(35<\bar{X}< 40)=P(\bar{X}< 40)-P(\bar{X}<35)\approx 1-0.0015=0.9985.$$`

---

## Sampling Distribution of a Sample Proportion

- When working with categorical variables, we often study the proportion of a data set.
  
  The proportion of a specific characteristic in a data set can be viewed as the mean of the data set by identifying the specific characteristic with 1 and others with `$0$`.

**Example:** Consider the following data set
  .center[.red[1], 0, .red[1], .red[1], 0, 0, .red[1], 0, .red[1], .red[1]]
  Find proportion of .red[red numbers] and the mean of the data set.

**Solution:** The proportion of red numbers is `$\frac{6}{10}=0.6$`. So is the mean: `$\frac{6\cdot 1 + 4\cdot 0}{10}=0.6$`.

- Consider a population consisting of 1s and 0s. Let `$p$` be the proportion of 1s. Then standard deviation is
  `$$\sigma=\sqrt{(1-p)^2p+(0-p)^2(1-p)}=\sqrt{p(1-p)}.$$`

---

## Central Limit Theorem for Proportion

- For a sampling distribution of sample proportion, we write `$\hat{P}$` for the random variable of sample proportions.

- **Central Limit Theorem for Proportion:**
  
  For large samples, the distribution of sample proportions `$\hat{P}$` is approximately normal, with the mean `$\mu_{\hat{P}}=p$` and standard deviation `$\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}$`, where `$p$` is the population proportion.

---

## Required Sample Size to Apply the Central Limit Theorem for Proportion

- As a sample proportion is always between 0 and 1, and 99.7% of sample proportions lie within 3 standard deviation away from the population proportion, when using the central limit theorem for proportion, we require the sample size `$n$` satisfying the following condition: the interval `$\left[p-3\sqrt{\frac{p(1-p)}{n}}, p+3\sqrt{\frac{p(1-p)}{n}}\right]$` lies wholly in the interval `$[0, 1]$`.

- In practice, if `$n$` satisfies the following two inequalities: `$np\ge 10$` and `$n(1-p)\ge 10$`, then we consider `$n$` is large enough for assuming that the sampling distribution of the sample proportion is approximately normal.

- When the population proportion `$p$` is unknown, to apply the central limit theorem for proportion, we require the sample size `$n$` satisfying the same conditions with `$p$` replaced by the sample proportion `$\hat{p}$`. That is, the sample size `$n$` should satisfies `$n\hat{p}\ge 10$` and `$n(1-\hat{p})\ge 10$`.

---

## Example: Sampling Voters

Suppose that in a population of voters in a certain region 53% are in favor of a particular law. Nine hundred randomly selected voters are asked if they favor the law.

Find the probability that the sample proportion computed from a random sample of size 900 will be at least 2% above true population proportion.

**Solution:** We first verify that the sampling distribution is approximately normal.

Since `$p=0.53$` and `$n=900$`, `$np=900\cdot 0.53>10$` and `$n(1-p)=900(1-0.53)>10$`. By the central limit theorem, the sampling distribution is approximately normal.

The standard deviation of the sampling distribution is `$\sigma_{\hat{P}}=\sqrt{\frac{0.53(1-0.53)}{900}}\approx 0.017$`.

Then the probability that the random sample has a proportion at least 2% above 53% is
`$$P(\hat{P}>0.55)=1-P(\hat{P}\le 0.55)\approx 1-0.8803=0.1197.$$`

---

## Example: Traffic Accidents Caused by Distraction

Suppose that in 36% of all car accidents involve injury. Find the probability that the injury rate in a random sample of 250 car accidents is between 30% and 45%.

**Solution:** Firs we verify that the sample size is large enough to assume the sample proportion is approximately normally distributed by the Central Limit Theorem.

The injury rate of all car accidents is `$p=10\%=0.3$` and the sample size is `$250$`. Because `$np=250\cdot 0.36=90>10$` and `$n(1-p)=250\cdot(1-0.36)=160>10$`, the sample size is considered large enough.

Let `$\hat{P}$` be the sample proportion of a random sample. By the Central Limit Theorem, the distribution of `$\hat{P}$` is approximately normally with the mean `$p=0.36$` and standard deviation `$\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}\approx 0.03$`

Using the Excel function, `NORM.DIST(x, mean, SD, TRUE)`, we find the probability of a random sample of 250 car accidents with the injury rate between 30% and 45% is
$$
  P(0.30<\hat{P}<0.45)=P(\hat{P}<0.45)-P(\hat{P}<0.30)
  \approx 0.999-0.023
  =0.976
$$

---

## Practice: Sample mean within an interval

An unknown distribution has a mean of 28 and a standard deviation 6. Samples of size n = 30 are drawn randomly from the population. Find the probability that the sample mean is between 27 and 30.

---

## Practice: Sample mean of GPA

The numerical population of grade point averages at a college has mean 2.61 and standard deviation 0.5. If a random sample of size 100 is taken from the population, what is the probability that the sample mean will be between 2.51 and 2.71?

.footmark[
  Source: [Example 4 in Section 6.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-02-the-sampling-distribution-of-t.html)
]

---

## Practice: Proportion of red candy

---

## Practice: Proportion of voting

In a mayoral election, based on a poll, a newspaper reported that the current mayor received 45% of the vote. If this is true, what is the probability that a random sample of 100 voters had less than 35% voting for the current mayor?

---

## Practice: Sampling Distribution of Mean with Unknown Population Distribution

A population has mean 73.5 and standard deviation 2.5.

1. Find the mean and standard deviation of `$\bar{X}$` for samples of size 30.
2. Find the probability that the mean of a sample of size 30 will be less than 72.

.footmark[
  Source: [Exercise 3 in Section 6.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-02-the-sampling-distribution-of-t.html).
]

---

## Practice: Sampling Distribution of Mean with Normal Population Distribution

A normally distributed population has mean 57.7 and standard deviation 12.1.

1. Find the probability that a single randomly selected element X of the population is less than 45.
2. Find the mean and standard deviation of `$\bar{X}$` for samples of size 16.
3. Find the probability that the mean of a sample of size 16 drawn from this population is less than 45.

.footmark[
  Source: [Exercise 6 in Section 6.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-02-the-sampling-distribution-of-t.html).
]

---

## Practice: Cholesterol Level in Large Eggs

Suppose the mean amount of cholesterol in eggs labeled “large” is 186 milligrams, with standard deviation 7 milligrams. Find the probability that the mean amount of cholesterol in a sample of 144 eggs will be within 2 milligrams of the population mean.

.footmark[
  Source: [Exercise 15 in Section 6.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-02-the-sampling-distribution-of-t.html).
]

---

## Practice: Color Blindness Rate

Suppose that 8% of all males suffer some form of color blindness. Find the probability that in a random sample of 250 men at least 10% will suffer some form of color blindness.

.footmark[
  Source: [Exercise 13 in Section 6.3 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-03-the-sample-proportion.html).
]

---

## Practice: Testing an Airline's Claim

An airline claims that 72% of all its flights to a certain region arrive on time. In a random sample of 30 recent arrivals, 19 were on time. You may assume that the normal distribution applies.

1. Compute the sample proportion.
2. Assuming the airline’s claim is true, find the probability of a sample of size 30 producing a sample proportion so low as was observed in this sample.

.footmark[
  Source: [Exercise 17 in Section 6.3 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-03-the-sample-proportion.html).
]

---

## Practice: Minimal Mean Weight of a Particular Fruit

---

## Lab: The `NORM.DIST()` Function

- The `$Z$`-score of a value `$x$` of a Random variable can be calculated using the Excel function `STANDARDIZE(x, mean, sd)`.

- Let `$Z$` be a standard normal random variable. The probability `$P(Z<z)$` can be calculated using the Excel function `NORM.S.DIST(z, TRUE)`

---
class: center middle topic

# Confidence Intervals for Mean

---

## Learning Goals for Confidence Intervals

- Determine whether the study meets the conditions under which inferences on a population parameter may be performed.

- Demonstrate understating of confidence level `$1-\alpha$`.

- Explain when and why to use the normal distribution or the *t*-distribution for a given study.

- Determine the appropriate degrees of freedom associated with the *t*-distribution.

- Determine the critical values using tables or Excel functions.

- Describe how the following will affect the width of the confidence interval:
  - increasing the sample size;
  - increasing the confidence level.

- Construct and interpret a confidence intervals for one population mean.

---

## Point Estimation

- When estimating a population parameter, we may consider the statistic of a random sample as an estimate of the population parameter. But we expect some chance error.

- Estimating an unknown parameter by a single number calculated from a sample is called a **point estimation**. The single number (statistic) from the sample is called a **point estimate**.

- Point estimate gives no indication of how reliable the estimate is or how large the error is.

---

## Example: Estimating Population Proportion by a Sample Proportion

From a box of 20 pencils of two colors, black and blue, 10 pencils were randomly drawn. 6 out of the 10 pencils are black. What proportion of black pencils are in the box.

**Solution:** Since the sample proportion is 0.6, one may make a point estimation that 60% of the box, or 12 are black pencils. However, we don't know how close the sample proportion is to the population proportion.

---

## Interval Estimation

- To increase the chance, we estimate an unknown parameter using intervals that are obtained by adding chance errors to a point estimate.
  
- Estimating an unknown parameter using an interval of values which likely contains the true value of the parameter is called a **interval estimation**. The interval is called an **interval estimate**.

- The reliability of an interval estimate is measured by the probability `$1-\alpha$` that the interval estimate will capture the true value of the parameter. This probability `$1-\alpha$` is called the [**confidence level**](https://saylordotorg.github.io/text_introductory-statistics/s11-estimation.html).

- The 90%, 95% and 99% level of confidence are frequently used in statistical study. The 95% level of confidence is usually the standard choice of confidence level for scientific polls published in the media and online.

---

## Example: Average GPA Falling in an Interval

Recall that the **standard error** of a statistic, denoted by SE, is the standard deviation of the sampling distribution.

A randomly selected 100 students at a college have an average GPA 3.0. How likely does the interval `$[3.0-2\cdot\text{SE}, 3.0+2\cdot\text{SE}]$` contain the average GPA `$\mu$` of that college?

**Solution:** The probability that the interval `$[3.0-2\cdot\text{SE}, 3.0+2\cdot\text{SE}]$` contains the population mean `$\mu$` equals the probability that the sample statistic 3.0 lies in the interval `$[\mu-2\cdot\text{SE}, \mu+2\cdot\text{SE}]$`. Since, `$[\mu-2\cdot\text{SE}, \mu+2\cdot\text{SE}]$` contains 95.5% of data of the population.

That means, we can be 95.5% confidence that the interval `$[3.0-2\cdot\text{SE}, 3.0+2\cdot\text{SE}]$` contains the average GPA `$\mu$` of that college.

---

## Confidence Interval

- When the sampling distribution of a statistic is approximately symmetric, we take interval estimates in the following form `$[\text{Statistic}- \text{E}, \text{Statistic}+ \text{E}],$` where the value `$\text{E}$` is called the **marginal error** or **margin of error**.

- Given a confidence level `$100(1-\alpha)\%$`, the marginal error `$\text{E}$` is the value such that `$100(1-\alpha)\%$` of the intervals `$[\text{Statistic}- \text{E}, \text{Statistic}+ \text{E}]$` contains the true parameter `$\mu_\text{par}$`. Equivalently, the marginal error `$\text{E}$` is the value such that `$100(1-\alpha)\%$` of statistics are in the interval `$[\mu_\text{par}- \text{E}, \mu_\text{par}+ \text{E}]$`.

- Denote by `$X$` the random variable for the sample statistic. Then `$\text{E}$` is determined the following probability equation
  `$$P(\mu_\text{par}-\text{E}< X < \mu_\text{par}+\text{E})=1-\alpha.$$`
  
  If the distribution of `$X$` is symmetric, then the marginal error `$E$` is the value such that
  `$$P(X-\mu_\text{par}<\text{E})=1-\alpha/2.$$`

---

## Visualization: Confidence Interval for Mean

---

## Confidence Intervals for Mean with Known Population SD

- Suppose the population standard deviation `$\sigma$` is given. By the central limit theorem, if `$n>30$` or the population distribution is approximately normal, then the sampling distribution is approximately normal with the standard error `$\sigma/\sqrt{n}$`.  
  
  At the confidence level `$1-\alpha$`, the marginal error `$E$` for a population mean `$\mu$`  is
  `$$E=z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$$` and the confidence interval is
  `$$\left[\bar{x}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}}, \bar{x}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right],$$`
  where [the **critical value** `$z_{\alpha/2}$` satisfies that `$P(Z<z_{\alpha/2})=1-\alpha/2$`](https://saylordotorg.github.io/text_introductory-statistics/s11-01-large-sample-estimation-of-a-p.html) for the standard normal variable `$Z$`.

- In Excel, `$z_{\alpha/2}$`=`NORM.S.INV(1-α/2)`. Symmetrically, `$z_{\alpha/2}$`=`-NORM.S.INV(α/2)`.

---

## Example: Find Critical Values

A sample of size 15 drawn from a normally distributed population with the standard deviation 6. Find the critical value `$z_{\alpha/2}$` needed in construction of a confidence interval:

1. when the level of confidence is 90%;
2. when the level of confidence is 98%.

**Solution:** One can find the critical value `$z_{\alpha/2}$` by using the normal distribution table. Here, we will use the Excel function `NORM.S.INV(prob)`

1. By our definition, `$1-\alpha=0.9$`. Then `$\alpha=1-0.9=0.1$`. Using the Excel function `NORM.S.INV(1-0.1/2)`, we get the critical value
   `$$z_{\alpha/2}=1.6449.$$`

1. By our definition, `$1-\alpha=0.98$`. Then `$\alpha=1-0.98=0.02$`. Using the Excel function `NORM.S.INV(1-0.02/2)`, we get the critical value
   `$$z_{\alpha/2}=2.3263.$$`

.footmark[
  [Checkout this normal distribution interactive app](https://istats.shinyapps.io/NormalDist/)
]

---

## Example: Mean GPA with Known Population SD

A random sample of 50 students from a college gives a mean GPA 2.51. Suppose the standard deviation of GPA of all students at the college is 0.43. Construct a 99% confidence interval for the mean GPA of all students at the college.

**Solution:** We first gather information from the question:

- The sample size is `$n=50$`,
- The sample mean is `$\bar{x}=2.51$`,
- The population standard deviation is `$\sigma=0.43$`, and
- The confidence level is `$1-\alpha=0.99$` which implies that `$\alpha/2=0.005$`.

Now let's find the critical value and the standard error.

- The critical value `$z_{0.005}$`=`-NORM.S.INV(0.005)` `$\approx 2.576$`
- The standard error is `$\sigma_{\bar{x}}=\sigma/\sqrt{n}=0.43/\sqrt{50}\approx 0.06.$`

Then the marginal error is `$\text{E}=z_{0.005}\cdot\sigma_{\bar{x}}=2.576\cdot 0.06\approx 0.16.$` One may conclude with the 99% confidence that the true mean GPA of all students is contained in the confidence interval `$[2.51-0.16, 2.51+0.16]=[2.45, 2.67]$`.

---

## Student's `$t$`-Distribution

- When the population standard deviation is unknown, we may replace `$\sigma$` by the sample standard deviation `$s$` and use `$s/\sqrt{n}$` as an estimate to the standard error for the sampling distribution of the sample mean.

- When we use the estimated standard error `$s/\sqrt{n}$` to build a confidence interval, the .red[normal distribution may NOT] be appropriate for calculating the critical value.

- Indeed, if the random variable `$\bar{x}$` is approximately normal, then the random variable `$t=\dfrac{\bar{x}-\mu}{s/\sqrt{n}}$` has a **[Student's `$t$`-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) with the degree of freedom `$n-1$`**.

???

- This result was discovered by William Gosset, an employee of the Guinness brewing company, who published his result using the name Student.

- Unlike in the case of a sample proportion, the sample standard deviation `$s$` is not determined by the sample mean `$\bar{x}$`.

---

## Properties of Student's `$t$`-Distribution

- The `$t$`-distributions is a family of curves, called ** `$t$`-curves**, parameterized by the degrees of freedom.

- The `$t$`-distribution has the following important properties.
  1. Similar to the standard normal curve, it is .blue[symmetric about 0] and the .blue[total area] under a `$t$`-curve .blue[is 1].
  2. The `$t$`-distribution has slightly more variation (i.e. `$t$`-curves are .blue[slightly “fatter”]) than the standard normal distribution.
  3. When the .blue[degree of freedom increases], the `$t$`-distribution becomes .blue[closer to the standard normal distribution].

- In practice, when the sample size is large enough `$n>30$`, people use normal distribution as an approximation for the Student `$t$`-distribution.

---

## Visualization: `$t$`-distributions

## Confidence Intervals for a Mean with .red[Unknown] Population SD

- Suppose the sampling distribution is approximately normal. At the confidence level `$1-\alpha$`, the margin of error is
  `$$E=t_{\alpha/2}\frac{s}{\sqrt{n}},$$`
  and the confidence interval for a population mean `$\mu$` is
  `$$\left[\bar{x}-t_{\alpha/2}\frac{s}{\sqrt{n}}, \bar{x}+t_{\alpha/2}\frac{s}{\sqrt{n}}\right],$$`
  where `$t_{\alpha/2}$` is the critical value such that `$P(T<t_{\alpha/2})=1-\alpha/2$` for a Student `$t$`-distribution with degree of freedom `$n-1$`.

- In Excel, the critical value `$t_{\alpha/2}$` can be calculated by `T.INV(1-α/2, n-1)` or `T.INV.2T(1-α, n-1)`, where `$n$` is the sample size.

---

## Example: Critical Values for `$t$`-Distributions

A sample of size 15 drawn from a normally distributed population. Find the critical value `$t_{\alpha/2}$` needed in construction of a confidence interval:

1. when the level of confidence is 99%;
2. when the level of confidence is 95%.

**Solution:** To find the critical value `$t_{\alpha/2}$`, we may use the Excel function `T.INV(left tail area, df)` or `T.INV.2T(tail areas, df)`.

1. Since `$1-\alpha=0.99$`, we have `$\alpha/2=(1-0.99)/2=0.005$` and
   `$t_{0.005}$` =`T.INV.2T(0.01, 14)`=`-T.INV(0.005, 14)`=`T.INV(1-0.005, 14)`=2.9768.

2. Since `$1-\alpha=0.95$`, we have `$\alpha/2=(1-0.95)/2=0.025$` and
   `$t_{0.025}$` =`T.INV.2T(0.01, 14)`=`-T.INV(0.025, 14)`=`T.INV(1-0.025, 14)`=2.1448.

---

## Example: Confidence Interval with Unknown Population SD

A sample of size 16 is randomly drawn from a normally distributed population. The sample has a mean 79 and standard deviation 7. Construct a confidence interval for that population mean at the 90% level of confidence.

**Solution:** Since the population is normally distributed, and the population standard deviation is unknown, we apply the formula `$\text{E}=t_{\alpha/2}\cdot\dfrac{s}{\sqrt{n}}$` for marginal error.

Since the sample size is 16, the degree of freedom is .red[df=15].

At 90% confidence level, `$\alpha=1-0.9=0.1$`.

From the table or using the function `T.INV.2T(0.1, 15)`, we find that `$t_{0.05}\approx 1.753$`.

Then the marginal error is `$\text{E}=1.753\cdot 7/\sqrt{16}\approx 3$`. Thus `$\bar{x}-\text{E}=79-3=76$` and `$\bar{x}+\text{E}=79+3=82$`.

With 90% confidence, we may conclude that the true population mean is contained in the interval `$[76, 82]$`.

---

## Example: Average Working Hours in Grocery Stores

The data blow shows numbers of hours worked from 40 randomly selected employees from several grocery stores in the county.

Construct 99% confidence interval for the mean worked time.

**Solution:** Since the sample size is 40 (>30), by the central limit theorem, the sample mean is approximately normally distributed.
Using the Excel functions `AVERAGE()` and `STDEV.S()` to the data, we find `$\bar{x}\approx 29.6$` and `$s\approx 5.3$`.

Since `$df=40-1=39$` and `$\alpha=1-0.99=0.01$`, the critical value `$t_{0.005}\approx 2.708$` and the marginal error is `$\text{E}=2.708\cdot 5.3/\sqrt{40}\approx 2.3$`. Thus `$\bar{x}-\text{E}=29.6-2.3=27.3$` and `$\bar{x}+\text{E}=29.6+2.3=31.9$`

With a 99% confidence, one may conclude that the true mean worked hours is contained in the interval `$[27.3, 31.9]$`.

---

## Choose Between Normal Distribution and `$t$`-Distribution

- Population is .red[approximately normally] distributed.
  - the population standard deviation `$\sigma$` is .blue[known]: use the .blue[normal distribution].
  - the population standard deviation `$\sigma$` is .yellow[*unknown*]: use the `$t$`-.yellow[*distribution*].

- Population distribution unknown, but .gold[sample size is large enough], i.e. `$n>30$`.
  - the population standard deviation `$\sigma$` is .gray[known]: use .gray[normal distribution].
  - the population standard deviation `$\sigma$` is .purple[*unknown*]: either one can be used but the `$t$`-.purple[*distribution*] is more accurate.

- **.red[Warning:]** Population distribution unknown and the .red[sample size is small, neither] the `$t$`-distribution nor the normal distribution is reliable.

- For small samples, there is method called "[The Shapiro–Wilk test](http://www.sthda.com/english/wiki/normality-test-in-r#normality-test)" which can be used to determine if we may assume the sampling distribution is approximately normal.

- Even `$n>30$`, we should have a visual inspection (using histogram for example) of the normality.

---

## Practice: Conceptual Questions on Confidence Intervals

Decide whether the following statements are true or false. Explain your reasoning.

- The statement, "the 95% confidence interval for the population mean is (350, 400)" means that 95% of the population values are between 350 and 400.
- For a given standard error, lower confidence levels produce wider confidence intervals.
- If you increase sample size, the width of confidence intervals will increase.
- If you take large random samples over and over again from the same population, and make 95% confidence intervals for the population average, about 95% of the intervals should contain the population average.

.footmark[
  Source: [Conceptual Questions on Confidence Intervals](http://www2.stat.duke.edu/~jerry/sta101/confidenceintervalsans.html)
]

---

## Practice: Confidence Interval sigma known SAT Scores

---

## Practice: Find the Critical Value

---

## Practice: Find the Marginal Error

---

## Practice: How Much Alcohol Do College Students Drink

A statistics student is curious about drinking habits of students at his college. He wants to estimate the mean number of alcoholic drinks consumed each week by students at his college. He plans to use a 90% confidence interval. He surveys a random sample of 71 students. The sample mean is 3.93 alcoholic drinks per week. The sample standard deviation is 3.78 drinks.

.footmark[
  Source: [Estimating a Population Mean (2 of 3)](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/estimating-a-population-mean-2-of-3/)
]

---

## Practice: Estimating Average Distance from Home to Workplace

Four hundred randomly selected working adults in a certain state, including those who worked at home, were asked the distance from their home to their workplace. The average distance was 8.84 miles with standard deviation 2.70 miles.

Construct a 98% confidence interval for the mean distance from home to work for all residents of this state.

.footmark[
  Source: [Exercise 8 in Section 7.1 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s11-01-large-sample-estimation-of-a-p.html)
]

---

## Practice: Estimate Mean Lifetime

City planners wish to estimate the mean lifetime of the most commonly planted trees in urban settings. A sample of 16 recently felled trees yielded mean age 32.7 years with standard deviation 3.1 years. Assuming the lifetimes of all such trees are normally distributed, construct a 99.8% confidence interval for the mean lifetime of all such trees.

.footmark[
Source: [Exercise 7 in Section 7.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s11-02-small-sample-estimation-of-a-p.html)
]

---

## Practice: Confidence Interval from a small data set

---

## Lab: Excel Functions for `$t$`-Distributions

Suppose a Student's `$t$`-distribution has the degree of freedom `$\text{df}=n-1$`.

- Find a probability for a given `$t$`-value.

- The area of the left tail of the `$t$`-value may be calculated by the function `T.DIST(t,df,true)`.

- The area of the right tail of the `$t$`-value may be calculated by the function `T.DIST.R(t,df)`.

- The area of two tails of the `$t$`-value (here `$t$`>0) may be calculated by function `T.DIST.2T(t,df)`.

- Find the critical value for a given probability `$p$`.

- When the area of the left tail is given, the function `T.INV(p,df)` may be used.
  
  - When the area of both tails is given, the function `T.INV.2T(p,df)` may be used. This function is good for construction confidence interval.

---
class: center middle topic

# Confidence Intervals for Proportions

---

## Learning Goals for Confidence Intervals for Proportions

- Construct and interpret a confidence intervals for one population proportion.

- Describe how the following will affect the width of the confidence interval:

- increasing the sample size;

- increasing the confidence level.

---

## Confidence Intervals for a Proportion

- Recall that the standard error of sample proportions is `$\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}$`, where `$n$` is the sample size and `$p$` is the population proportion. As an consequence, when estimating the population proportion `$p$`, we only have a point estimate
  `$$\hat{\sigma}_{\hat{p}}=\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}$$`
  for the standard error of sample proportions, that is,
  `$$\sigma_{\hat{p}}\approx\hat{\sigma}_{\hat{p}}=\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}.$$`

- Based on the central limit theorem, when `$n$` is large enough, at the `$100(1-\alpha)\%$` level, the margin of error for `$p$` is defined as
  `$$E=z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},$$` and the
  confidence interval for `$p$` is defined by
  `$$\left[\hat{p}-z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right],$$`
  where [the critical value `$z_{\alpha/2}$` satisfies that `$P(Z<-z_{\alpha/2})=\alpha/2$`](https://saylordotorg.github.io/text_introductory-statistics/s11-01-large-sample-estimation-of-a-p.html) for the standard normal variable `$Z$`. In Excel, `$z_\alpha/2$`=`-NORM.S.INV(α/2)`=`NORM.S.INV(1-α/2)`.

- The sample size `$n$` is considered large enough if `$n\hat{p}\ge 10$` and `$n(1-\hat{p})\ge 10$`.

- The above defined confidence interval is known as the [normal approximation (or Wald's) confidence interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval).
  It is popular in introductory statistics books. However, it is unreliable when the sample size is small or the sample proportion is close to 0 or 1.
  Indeed, if the sample proportion is 0 or 1, the confidence interval defined here will have zero length.

???
By the central limit theorem, the random variable `$\hat{p}$` is normal distributed. The chance that `$p\in \left[\hat{p}-z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right]$` is the same as the chance that `$\hat{p}\in \left[p-z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}, p+z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\right]$`. That shows `$z_{\alpha/2}$` satisfying
`$$P(-z_{\alpha/2}<\dfrac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}<z_{\alpha/2})=1-\alpha.$$`

---

## Example: Estimating the Proportion of Students Taking Busses (1/2)

In a random sample of 100 students in college, 65 said that they come to college by bus.

1. Give a point estimate of the proportion of all students who come to college by bus.

2. Construct a 99% confidence interval for that proportion.

**Solution:** A good point estimate would be a sample proportion. Here the sample proportion is `$\hat{p}=65/100=0.65$`.

As `$n\hat{p}=100\cdot 0.65=65>10$` and `$n(1-\hat{p})=100\cdot 0.35=35>10$`, which implies the sample is large enough, approximately the standard error is
`$$\hat{\sigma}_{\hat{P}}=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=\sqrt{\frac{0.65(1-0.65)}{100}}\approx0.048.$$`

---

## Example: Estimating the Proportion of Students Taking Busses (2/2)

**Solution: (Continued)** At 99% level of confidence, the value `$\alpha=1-0.99=0.01$` and `$\alpha/2=0.01/2=0.005$`. The critical value `$z_{\alpha/2}$` is determined by the equation `$P(Z<-z_{0.005})=0.005$` (or equivalently `$P(Z<z_{0.005})=1-0.005$`). Using the Excel function `-NORM.S.INV(α/2)` (or `NORM.S.INV(1-α/2)`), we find the critical value `$z_{0.005}\approx 2.576$`.

Thus the marginal error is
`$$E=z_{\alpha/2}\cdot \hat{\sigma}_{\hat{P}}=2.576\cdot 0.048=0.123,$$`
and the confidence interval at 99% level is
`$$[\hat{p}-E, \hat{p}+E]\approx [0.65-0.123, 0.65+0.123]=[0.527, 0.773].$$`

Conclusion: we are 99% confident that the true proportion of all students at the college who take bus lies in the interval `$[0.527, 0.773]$`.

---

## Example: Estimating a Population Proportion (1/2)

Foothill College’s athletic department wants to calculate the proportion of students who have attended a women’s basketball game at the college. They use student email addresses, randomly choose 220 students, and email them. Of the 145 who responded, 22 had attended a women’s basketball game.

Calculate and interpret the approximate 90% confidence interval for the proportion of all Foothill College students who have attended a women’s basketball game.

**Solution:** Although 220 students were surveyed, only 145 responded. So the sample consists of those 145 students. The sample proportion is `$\hat{p}=\frac{22}{145}=0.152$`.

The estimated standard error is
`$$\hat{\sigma}_{\hat{P}}=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=\sqrt{\frac{0.152\cdot(1-0.152)}{145}}\approx 0.03.$$`

---

## Example: Estimating a Population Proportion (2/2)

**Solution: (Continued)**
At the 90% confidence level, `$\alpha=1-0.9=0.1$` and the critical value is `$z_{\alpha/2}$`=`NORM.S.INV(1-0.1/2)` `$\approx 1.645$`.

Therefore, the marginal error is
`$$E=z_{\alpha/2}\cdot\hat{\sigma}_{\hat{P}}\approx 1.645\cdot0.03=0.049,$$`
and the confidence interval is
`$$[\hat{p}-E, \hat{p}-E]\approx[0.152-0.049, 0.152+0.049]=[0.103, 0.049].$$`

Conclusion: we are 90% confident that the proportion of all Foothill College students who have attended a women’s basketball game is between 0.103 and 0.201.

.footmark[
  Source: [Estimating a Population Proportion in the book Concepts in Statistics](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/estimating-a-population-proportion-1-of-3/)
]

---

## Factors Affect the Width of Confidence Intervals

- The width of a confidence interval, equals twice the standard error, gives a measure of precision of the estimation.

- Recall, for population proportion and mean,
  `$$\text{Marginal Error} = \text{Critical Value}\cdot \frac{\text{(estimated) Population SD}}{\sqrt{\text{Sample Size}}}$$`

- The formula tells us the precision of a confidence interval is affected by the confidence level, the variability, and the sample size.
  
  - Larger the confidence levels give larger critical values and errors.
  
  - Populations (and samples) with more variability gives larger errors.

- Larger sample sizes give smaller errors.

---

## Sample Size Determination

- In practice, we may desire a marginal error of `$E$`. With a fixed confidence level `$100(1-\alpha)\%$`, the larger the sample size the smaller the marginal error.

- When estimating population proportion, if we can produce a reasonable guess `$\hat{p}$` for population proportion, then an appropriate minimum sample size for the study is determined by
  `$$n=\left(\frac{z_{\alpha/2}}{{E}}\right)^2\cdot \hat{p}(1-\hat{p}).$$`

- When estimating population mean, if we can produce a reasonable guess `$\sigma$` for the population standard deviation, then an appropriate minimum sample size is given by
  `$$n=\left(\dfrac{z_{\alpha/2}\cdot \sigma}{{E}}\right)^2.$$`

---

## Example: Minimum Sample Size - Error in Proportion

Suppose you want to estimate the proportion of students at QCC who live in Queens. By surveying your classmates, you find around 70% live in Queens. Use this as a guess to determine how many students would need to be included in a random sample if you wanted the error of margin for a 95% confidence interval to be less than or equal to 2%.

**Solution:** We may use `$\hat{p}=0.7$` as a reasonable guess for the population proportion.

At the 95% level, the critical value is `$z_{0.025}=$` `NORM.S.INV(1-0.025)` `$\approx 1.96$`.

The marginal error is `$E=0.02$`.

Then the appropriate minimal sample size is determined by
`$$n=\left(\frac{z_{\alpha/2}}{{E}}\right)^2\cdot \hat{p}(1-\hat{p})=(1.96/0.02)^2\cdot 0.7\cdot(1-0.7)=2016.84.$$`

Since the sample size has to be an integer, to get a error no more than 2% at the level 95%,  the minimal sample size should be at least 2017.

---

## Example: Minimum Sample Size - Error in Mean

Find the minimum sample size necessary to construct a 99% confidence interval for the population mean with a margin of error `$E =0.2$`. Assume that the estimated population standard deviation is `$\sigma=1.3$`.

**Solution:** At the 99% level, we have `$\alpha/2=(1-0.99)/2=0.005$`.

The critical value `$z_{0.005}$` `NORM.S.INV(1-0.005)` `$\approx 2.576$`.

The desired marginal error is `${E}=0.2$`.

The estimated population standard deviation is `$\sigma=1.3$`.

Then the minimal sample size is approximately
`$$n=\left(\dfrac{z_{\alpha/2}\cdot \sigma}{{E}}\right)^2\approx (2.576\cdot 1.3/0.2)^2 \approx 280.4.$$`

To get a error no more than 0.2 at the level 95%,  the minimal sample size should be at least 281.

---

## Practice: Conceptual Questions on Confidence Intervals

Decide whether the following statements are true or false. Explain your reasoning.

.footmark[
  Source: [Conceptual Questions on Confidence Intervals](http://www2.stat.duke.edu/~jerry/sta101/confidenceintervalsans.html)
]

---

## Practice: Find Confidence Interval of Proportion of Kids

---

## Practice: Confidence Intervals for a Population Proportion

To understand the reason for returned goods, the manager of a store examines the records on 40 products that were returned in the last year. Reasons were coded by 1 for “defective,” 2 for “unsatisfactory,” and 0 for all other reasons, with the results shown in the table.
<table class="table" style="margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(42, 120, 142, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(187, 223, 39, 1) !important;">2</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(42, 120, 142, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">2</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(42, 120, 142, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
  </tr>
  <tr>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(187, 223, 39, 1) !important;">1</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(42, 120, 142, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(187, 223, 39, 1) !important;">2</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">2</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(187, 223, 39, 1) !important;">2</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     color: rgba(46, 180, 124, 1) !important;">0</span> </td>
  </tr>
</tbody>
</table>

1. Give a point estimate of the proportion of all returns that are because of something wrong with the product, that is, either defective or performed unsatisfactorily.

2. Construct an 80% confidence interval for the proportion of all returns that are because of something wrong with the product.

---

## Practice: Find Confidence Interval of Proportion Given Table

---

## Practice: Minimum Sample Size - Mean

A software engineer wishes to estimate, to within 5 seconds, the mean time that a new application takes to start up, with 95% confidence. Estimate the minimum size sample required if the standard deviation of start up times for similar software is 12 seconds.

.footmark[
 Source: [Exercise 7 in Section 7.4 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s11-04-sample-size-considerations.html)
]

---

## Practice: Minimum Sample Size - Proportion

The administration at a college wishes to estimate, to within two percentage points, the proportion of all its entering freshmen who graduate within four years, with 90% confidence. Estimate the minimum size sample required.

.footmark[
 Source: [Exercise 13 in Section 7.4 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s11-04-sample-size-considerations.html)
]

---

## Lab: Excel Functions for Normal Distributions

- Let `$Z$` be a standard normal random varaible. In Excel, `$P(Z<z)$` is given by `NORM.S.DIST(z,TRUE)`.

- When a cumulative probability `$p=P(X<x)$` of a normal random variable `$X$` is given, we can find `$x$` using `NORM.INV(p,mean,sd)`.

- When a cumulative probability `$p=P(Z<z)$` of a standard normal random variable `$Z$` is given, we can find `$z$` using `NORM.S.INV(p)`.

---

# Concepts of Hypothesis Testing

---

## Learning Goals for Hypothesis Tests

- Choose appropriate null and alternative hypotheses.

- Determine whether the test should be one-sided or two-sided.

- Calculate `$Z$`-test statistics and `$T$`-test statistics.

- Calculate the `$P$`-value or the rejection region.

- Determine whether to reject or fail reject the alternative hypotheses.

- Interpret the results of a test of significance in context.

---

## The Basic Idea of Hypothesis Testing

- The testing procedure starts with an initial assumption that the statement on population parameter is true.

- We test this initial assumption using a random sample. If the initial assumption is really the truth, then the test statistic from a random sample shouldn't be too far away from the center of the sampling distribution. Conversely, if the test statistic is .green[too far away] from the center, then we should .green[not believe in] the initial assumption.

- To determine how far is too far away, we need to specify a threshold, a prior probability, or equivalently a critical value.

- If the test statistic is at least extreme as the critical value, then the testing is significant enough to allow us to reject the initial assumption. Otherwise, we cannot draw a definite conclusion.

- The prior probability measures the chance that the initial assumption was wrongly rejected.

---

## Two Hypotheses

- A statistical **hypothesis** is a statement about a population parameter.

- A **hypothesis test** is a process that uses sample statistics to test a **hypothesis**.

- To test a population parameter, we choose a pair of hypotheses, the null hypothesis and the alternative hypothesis which are contradictory to each other.

- The **null hypothesis**, denoted by `$H_0$`, is the statement about the population parameter that is assumed to be true.

- The **alternative hypothesis**, denoted `$H_a$`, is a statement about the population parameter that is contradictory to the null hypothesis.

---

## Example: Identify the Null and the Alternative Hypotheses

1. Test a statement that the population mean is 1.
2. Test a statement that the population mean is more than 3.
3. Test a statement that the population mean is no more than 3.

**Solution:** Keep in mind that the null hypothesis should always contains the equal sign. The alternative hypothesis is contrary to the null hypothesis.

1. We may set set the null hypothesis as `$H_0$`: `$μ = 1$`. Depending on the given information, otherwise, we may set the alternative hypothesis as `$H_a$`: `$μ\ne 1$`.
2. We may set set the null hypothesis as `$H_0$`: `$μ = 3$` and the alternative hypothesis as `$H_a$`: `$μ>1$`.
3. We may set set the null hypothesis as `$H_0$`: `$μ \leq 3$` and the alternative hypothesis as `$H_a$`: `$μ>3$`.

---

## The Logic of Hypothesis Testing

The logic of hypothesis testing and two types of error can be summarized in the following table.

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;">  </th>
   <th style="text-align:center;"> $H_0$ is true </th>
   <th style="text-align:center;"> $H_0$ is false </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Reject $H_0$ </td>
   <td style="text-align:center;"> .red[Type I Error] </td>
   <td style="text-align:center;"> .green[Correct decision] </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Fail to Reject $H_0$ </td>
   <td style="text-align:center;"> .green[Correct decision] </td>
   <td style="text-align:center;"> .red[Type II Error] </td>
  </tr>
</tbody>
</table>

The interpretation of hypothesis testing is summarized in the following table.

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;"> Testing statement is $H_0$ </th>
   <th style="text-align:left;"> Testing statement is $H_a$ </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Reject $H_0$ </td>
   <td style="text-align:left;"> .green[There is enough evidence to reject the statement] </td>
   <td style="text-align:left;"> .green[There is enough evidence to support the statement] </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fail to Reject $H_0$ </td>
   <td style="text-align:left;"> .blue[There is not enough evidence to reject the statement] </td>
   <td style="text-align:left;"> .blue[There is not enough evidence to support the statement] </td>
  </tr>
</tbody>
</table>

---

## Type of Errors in Hypothesis Testing

- Rejecting the null hypothesis when it is indeed true is called a **type I error**. The maximum allowable probability of making a type I error is called the **level of significance**, denoted by `$\alpha$`.

- Failing to reject the null hypothesis when the it is false is called a **type II error**. The probability of a type II error is usually denoted by `$\beta$`. The **power of a hypothesis test**, equals `$1-\beta$`, is the probability of rejecting the null hypothesis when it is false.

.footmark[
  Source: [An illustration of errors](https://effectsizefaq.com/2010/05/31/i-always-get-confused-about-type-i-and-ii-errors-can-you-show-me-something-to-help-me-remember-the-difference/).
 See also [the interactive demonstration of errors and the power](https://istats.shinyapps.io/power/).
]

???
`$$\alpha=P(\text{Type I error})= P(\text{reject a true }H_0).$$`

---

## Type of Tests

- If `$H_a$` has the form `$\mu\neq \mu_0$` the test is called a **two-tailed test**.

- If `$H_a$` has the form `$\mu<\mu_0$` the test is called a **left-tailed test**.

- If `$H_a$` has the form `$\mu>\mu_0$` the test is called a **right-tailed test**.

- Each of the last two forms is also called a **one-tailed test**.

---

## Rejection Region and Critical Value

- The sample statistic used to test the assumption is called a **test statistic**.

- A **rejection region** is the range of values for which the null hypothesis is unlikely to be true.

.center[
|Sign in `$H_a$` | `$\ne$` | `$<$` | `$>$`  |
|---|---|---|
|Rejection region  | Both sides | Left side | Right side |
]

- A **critical value** is a value that separates the rejection region from its complement. The calculation depends on the sampling distribution of the test statistic.

.footmark[
 See the interactive demonstration for [rejection regions for hypothesis tests](https://hselab.shinyapps.io/critvalues/).
]

- If a test statistic falls in the rejection region, then we may and will reject the null hypothesis.

---

## Visualization: Rejection Region and Critical Value

.center[
  ![:resize Illustration of Rejection Region and Critical Value, 60%](data:image/png;base64,#Figures/Fig-Rejection-Regions.png)
]

.footmark[
  Source: [Figure 8.2 in Section 8.1 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s12-01-the-elements-of-hypothesis-tes.html).
]

---

## Computational Remarks

- Based on the central limit theorem, when testing a hypothesis on mean or proportion, we will use either a (standard) normal distribution or a Student’s t-distribution.

- The critical value(s) at the significance level `$\alpha$` can be calculated in two steps:
  1. find the standard critical value
  2. Apply the (inverse) standardization formula
    `$$\text{Critical value}= \text{Std. critical value}\cdot\text{SE}\pm\text{Mean}.$$`

- To make a decision using rejection region, alternately, it is usually more convenient to compare the standard critical value with the **standardized test statistic**:
  `$$\text{Std. test statistic}=\frac{\text{Test statistic} - \text{Mean}}{\text{SE}}.$$`

---

## Standardized Rejection Region

- The standardized test statistic has a standard normal distribution

| Symbol in `$H_a$` | Type of Test   | Rejection Region |
|:---------:|:-------:|:-------:|
| `$<$`   | Left-tailed test  | `$(-\infty, -z_\alpha]$`    |
| `$>$`  | Right-tailed test | `$[z_\alpha, \infty)$`    |
| `$\neq$`   | Two-tailed test   | `$(-\infty, -z_{\alpha/2}]\cup [z_{\alpha/2}, \infty)$` |

- The standardized test statistic has a Student's `$t$`-distribution

| Symbol in `$H_a$` | Type of Test  | Rejection Region |
|:---------:|:-------:|:-------:|
| `$<$`  | Left-tailed test  | `$(-\infty, -t_\alpha]$`   |
| `$>$`  | Right-tailed test | `$[t_\alpha, \infty)$`  |
| `$\neq$`  | Two-tailed test   | `$(-\infty, -t_{\alpha/2}]\cup [t_{\alpha/2}, \infty)$` |

- Recall that `$z_{c}$`=`NORM.S.INV(1-c)` (or `$t_c$`=`T.INV(1-c, df)`) is the the value such that `$P(Z<z_c)=1-c$` (respectively `$P(T<t_c)=1-c$`).

---

## Example: Make a Decision Using Rejection Region

Suppose the population standard deviation is `$\sigma=4.3$`. At the significance level `$\alpha=0.02$`, construct the a standardized rejection region for the following test for the population mean
.center[
Test `$H_0: \mu=21.6$` vs. `$H_a: \mu<21.6$`.
]

Make a decision if a random sample has the size `$n=70$` and mean `$\bar{x}=20.5$`.

**Solution:** Due to the form of `$H_a: \mu< 21.6$`, the rejection region should contain the left tail.

Then the standard critical value is `$z_{0.02}$`=`Norm.S.Inv(1-0.02)` `$\approx$` -2.054. So the standard rejection region is `$(-\infty, -2.054]$`.

Because the standardized test statistic
$$
z=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}
=\frac{20.5-21.6}{4.3/\sqrt{70}}\approx -2.14
$$
is in the rejection region. We reject the null hypothesis.

---

## Observed Significance

- To make a decision, one may also compare probabilities. The **observed significance** (**\$ P \$-value**) of a test statistic is the probability of obtaining a sample statistic at least as extreme as the (observed) test statistic, given that the null hypothesis were true.

- `$P$`-Value as Tail area

.center[
  |Sign in `$H_a$` | `$\ne$` | `$<$` | `$>$`  |
  |---|---|---|
  | `$P$`-value  | Double of the tail area | Left tail area | Right tail area |
  ]

- Making decision by comparing the `$P$`-value with the significance level `$\alpha$`:
  
  - reject `$H_0$` if `$p≤\alpha$` and

- do not reject `$H_0$` if `$p>\alpha$`.

---

## Example: Make a Decision Using the `$P$`-value

Given the following testing hypotheses

find the `$P$`-value for the test in the above example and make a decision.

**Solution:** Because `$H_a$` is `$\color{purple}{p\ne p_0}$` and `$\color{grey}{\hat{p}=0.56>p_0}$`, the `$P$`-value is the .purple[double] of the .grey[right tail] area, that is, the `$P$`-value equals `$2P(\hat{p}>0.56)$`.

We first find the standard error of the null distribution:
`$$\text{SE}=\sqrt{p_0(1-p_0)/n}=\sqrt{0.5\cdot0.5/360}=0.03.$$`

The `$P$`-value is approximately 0.0455 which can be calculated by the Excel function `2*(1-Norm.Dist(0.56,0.5,0.03,true)`.

Since the `$P$`-value is smaller than `$\alpha$`, we reject the null hypothesis `$H_0$`.

---

## Practice: Conceptual Understanding on Hypothesis Testing

Decide whether the following statements are true or false. Explain your reasoning.

- In case of a left-tailed test, we reject the null hypothesis if the sample statistic is significantly smaller than the hypothesized population parameter.

- A `$P$`-value of 0.08 is more evidence against the null hypothesis than a `$P$`-value of 0.04.

- The statement, "the `$P$`-value is 0.03", is equivalent to the statement, "there is a 3% probability that the null hypothesis is true".

- Even though you rejected the null hypothesis,  it may still be true.

- Failing to reject null hypothesis means the null hypothesis is true.

- That the `$P$`-value of a sample statistic is `$p=0$` means the null hypothesis cannot be true.

.footmark[
Questions are partially taken from [Conceptual questions on hypothesis testing](http://www2.stat.duke.edu/~jerry/sta101/tests.html)
]

---

## Practice: Identify Hypotheses and and Determine the Type of Test

---

## Practice: Find the Rejection Region

Suppose the standardized test statistic has `$Z$`-distribution. Find the standard rejection region for each of the following testing scenario.

- `$H_{0}: \mu=27 \text{ vs. } H_{a} : \mu<27$` with `$\alpha=0.01$`
- `$H_{0}: \mu=52 \text{ vs. } H_{a} : \mu \neq 52$` with `$\alpha=0.05$`
- `$H_{0}: \mu=-105 \text{ vs. } H_{a} : \mu>-105$` with `$\alpha=0.02$`

---

## Practice: Find the `$P$`-value

Suppose we’re conducting a hypothesis testing for a population mean. Find the `$P$`-value for each of the following testing scenario with the given sample size `$n$` and the test statistics `$t$`.

- `$H_{0}: \mu=25 \text { vs. } H_{a} : \mu<25$`, `$n=30$`, `$t=-2.43$`.
- `$H_{0}: \mu=35 \text { vs. } H_{a} : \mu>35$`, `$n=50$`, `$t=2.13$`.
- `$H_{0}: \mu=-7.9 \text { vs. } H_{a} : \mu\ne-7.9$`, `$n=40$`, `$t=-1.99$`.

---

## Practice: Make a Decision Based on the `$P$`-value

---

## Practice: Interpret a Decision

---

## Lab: Excel Functions for Normal Distributions

- Let `$Z$` be a standard normal random varaible. In Excel, `$P(Z<z)$` is given by `NORM.S.DIST(z,TRUE)`.

- When a cumulative probability `$p=P(X<x)$` of a normal random variable `$X$` is given, we can find `$x$` using `NORM.INV(p,mean,sd)`.

- When a cumulative probability `$p=P(Z<z)$` of a standard normal random variable `$Z$` is given, we can find `$z$` using `NORM.S.INV(p)`.

---

## Lab: Excel Functions for `$T$`-Distributions

Suppose a Student's `$T$`-distribution has the degree of freedom `$\text{df}=n-1$`.

- To find a probability for a given `$T$`-value

- The area of the left tail of the `$T$`-value may be calculated by the function `T.DIST(t,df,true)`.

- The area of the right tail of the `$T$`-value may be calculated by the function `T.DIST.R(t, df)`.

- The area of two tails of the `$T$`-value (\$t>0\$) may be calculated by function `T.DIST.2T(t,df)`.

- To find the critical value for a given probability `$p$`

---
class: center middle topic

# Hypothesis Testing for One Mean or One Proportion

---

## Learning Goals for Hypothesis Tests

- Perform an appropriate hypothesis test for a statement about mean using data from a random sample.

- Perform an appropriate hypothesis test for a statement about proportion using data from a random sample.

---

## Hypothesis Testing Procedure

1. Check if the sample size is large enough and determine if a `$Z$`-test or `$T$`-test can be performed. For proportion, `$Z$`-test may be used. For mean, if `$\sigma$` is known, the `$Z$`-test may be used. If `$\sigma$` is unknown, the `$T$`-test may be used.

2. State the null and alternative hypothesis. The null hypothesis always contains the equal sign (and possibly a less than or greater than symbol, depending on `$H_a$`.)

3. Set a significance level `$\alpha$`. Commonly used levels are `$\alpha=0.01$`, `$\alpha=0.05$` and `$\alpha=0.1$`.

4. Calculate the standardized test statistic: the `$Z$`-test statistic or the `$T$`-test statistic.

5. Calculate the `$P$`-value, or construct the rejection region. (.green[Recommend to draw pictures.])  
  .center[
  |Sign in `$H_a$` | `$\ne$` | `$<$` | `$>$`  |
  |---|---|---|
  | Test | Two-tailed | Left-tailed | Right-tailed|
  ]

6. Make a test decision about the null hypothesis `$H_0$`. We reject `$H_0$` if the test statistic falls in the rejection region or the `$P$`-value less than the significance level `$\alpha$`.

7. State an overall conclusion.

---

## Some Remarks

- Test statistics often refer to the standard test statistics which give more details on the relative difference.
  
- The `$P$`-value is slightly more popular in hypothesis testing. Because it gives a more detailed explanation of the data and is easier for making decision at different significance levels.

- A hypothesis test decision may be interpreted using the confidence interval. The rejection region of a hypothesis test can be obtained as the complement of a confidence interval.

- A hypothesis testing procedure is comparable to a criminal trial: a defendant is considered not guilty as long as his or her guilt is not proven. See Wiki page on [Statistical Hypothesis Testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) for more detail on the comparison.

---

## Example: Test a Mean with Known SD Using Rejection Region (1/2)

Residences on a certain street claim that the mean speed of automobiles run through the street is greater than the speed limit of 25 miles per hour. A random sample of 100 automobiles has a mean speed of 26 miles per hour. Assume the population standard deviation is 4 miles per hour. Is there enough evidence to support the claim of the residences at the significance level `$\alpha = 0.05$`?

**Solution:** The sample size is `$n=100>30$`. So the sampling distribution of sample means is approximately normal by the central limit theorem.

To test the claim of the residences, we set `$H_0:\mu=25$` and `$H_a: \mu >25$`.

Because `$H_a$` contains the `$>$` sign and `$\sigma$` is known, we use right-tailed `$Z$`-test.

Since the population standard deviation is  `$\sigma=4$`. We use the standard normal distribution to find the critical value.

With the given significance level `$\alpha=0.05$`, we find the critical value is `$z_{0.05}=1.64$` given by the Excel function `NORM.S.INV(1-0.05)`.

---

## Example: Test a Mean with Known SD Using Rejection Region (2/2)

**Solution: (Continued)** The rejection region is the interval `$[1.64, \infty)$`.

.center[
  ![:resize Right-Tail Test for a Proportion, 50%](data:image/png;base64,#Figures/right-tail-mean-01.png)
]

The `$Z$`-test statistic is `$z=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}=\frac{26-25}{4/\sqrt{100}}=2.5$`. It is in the rejection region. So we reject the `$H_0:\mu=25$` and support the `$H_a:\mu>25$`.

At the confidence level `$\alpha=0.05$`, there is enough evidence to support the claim of the residences that the average speed of automobile is above the speed limit.

---

## Example: Test a Mean with Unknown SD Using Rejection Region (1/2)

A car manufacturer claims that a new fuel injection design increases the mean mileage on a certain model of car above its current 28.5 miles per gallon level. Twenty-five of the new designs were checked and the mean recorded as 30.0 miles per gallon with a standard deviation of 3.8 miles per gallon. Assume that mean mileages are approximately normally distributed. Evaluate this claim at the 5% level of significance.

**Solution:** Since the population is approximately normally distributed, the sampling distribution of sample means is approximately normal by the central limit theorem.

To test the claim of the residence, we set `$H_0:\mu=\mu_0=28.5$` and `$H_a: \mu >28.5$`.

Because the alternative hypothesis claims "greater" and `$\sigma$` is unknown, we use right-tailed `$T$`-test.

Since the population standard deviation is unknown. We use the `$T$`-distribution to test the claim.

The degree of freedom is `$\text{df}=25-1=24$` With the given significance level `$\alpha=0.05$`, we find the critical value is `$t_{0.05}=1.71$` given by the Excel function `T.INV(1-0.05, 24)`.

---

## Example: Test a Mean with Unknown SD Using Rejection Region (2/2)

**Solution: (Continued)** The critical region is `$[1.71, \infty)$`.

.center[
  ![:resize Right-Tail Test for a Mean, 40%](data:image/png;base64,#Figures/T-right-tail-01.png)
]

The `$T$`-test statistic is
$$
t=\dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}=\dfrac{30-28.5}{3.8/\sqrt{25}}=1.97.
$$

Because `$t=1.97>t_{0.05}=1.71$`, that is, the `$T$`-test statistics in in the rejection region. We reject `$H_0$` and support the alternative hypothesis `$H_a:\mu>28.5$`.

At the significance level 5%, there is enough evidence to support the claim that new designs increase the mean mileage.

---

## Example: Test a Mean with Unknown SD Using `$P$`-value

**Example:** A certain manufacturer claims that average numbers of candies in a certain sized bag that they produce is 20. To test the claims, you collected a random sample of 10 bags and find the mean is 18 and the standard deviation is 2.7. Assume the numbers of candies are normally distributed. At the significance level `$\alpha=0.05$`, does your analysis support the manufacturer's claim?

**Solution:** Since the population is normally distributed, the sampling distribution for the sample mean is approximately normal.

Set `$H_0: \mu=20$` and `$H_a: \mu\neq 20$`.

Since `$H_a$` has the `$\ne$` sign and the population standard deviation is unknown, we use two-tailed `$t$`-test. We will find the `$P$`-value.

The `$T$`-test statistics is `$t=\frac{18.2-20}{2.7/\sqrt{10}}\approx-2.342.$` Using the Excel, we find that the `$P$`-value is `$p\approx$` `T.DIST.2T(2.342,9)`=0.0439.

Since the `$P$`-value is smaller that the significance level, we reject `$H_0$` which means there is not enough evidence to support the manufacturer's claim at the significance level 0.05.

---

## Example: Test a Mean Using `$P$`-value from a Data Set (1/2)

An instructor would like to know if the students enrolled in a math course in the  current semester performed better than students in the last semester. The mean final exam from last semester is 75.5. The final exam scores of 40 randomly selected 40 students were obtained

</div>

Do the data provide evidence that the students in this semester performed significantly better on the final than last semester?

**Solution:** The sample size is `$n=40$` which is large enough so that the sampling distribution for the sample mean is approximately normal. We will take the `$P$`-value approach.

Set `$H_0: \mu=75.5$` and `$H_a: \mu>75.5$`.

Using Excel functions `AVERAGE()` and `STDEV.S()`, we find the sample mean is `$\bar{x}\approx 78.17$` and sample standard deviation is `$s\approx 8.39$`.

---

## Example: Test a Mean Using `$P$`-value from a Data Set (2/2)

**Solution: (Continued):** The `$T$`-test statistic is calculated by
`$$t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{(78.17-75.5)}{8.39/\sqrt{40}}\approx 2.013.$$`

Because `$H_a$` contains the `$>$` sign and `$\sigma$` is unknown, we use the right-tailed `$T$`-test.

The degree of freedom is `$\text{df}=40-1$` The `$P$`-value is the right tail area under the `$T$`-curve, that is `T.DIST.R(2.013, 39)`=0.0255.

Since the `$P$`-value is less than 5%, at the 5% level of significance, we may reject `$H_0$`. So at 5% level of significance,  there is enough evidence to support the claim that the students in this semester performed significantly better on the final than last semester.

However, using the 2% level of significance, with the given data, we fail to reject `$H_0$`. Then, at the 2% level of significance, there is not enough evidence to support the claim.

---

## Example: Fairness of a Coin

Suppose you want to determine if a coin is fair. You toss the coin 50 times and observe 16
heads and 34 tails. If the coin is fair, the probability of getting 16 heads or less is about 0.008 = 0.8%. At the significant level 0.01, do you think that the coin is fair?

**Solution:** Since `$n\hat{p}=16$` and `$n(1-\hat{p})=34$`, a `$Z$`-test is valid.

To test if the coin is fair, we set the null hypothesis as `$H_0$`: `$p_0=0.5$`. The experiment suggests that we should set the alternative hypothesis as `$H_a$`: `$p_0<0.5$`.

The test statistic is `$\hat{p}=\frac{16}{50}=0.32$` and the standardization is
`$$z=\dfrac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}=\dfrac{0.32-0.5}{\sqrt{0.5(1-0.5)/50}}=-2.55.$$`

From `$H_a$`, we know that the test is left-tailed. The `$P$`-value is then `$P=0.008$`.

Because the significance level is `$\alpha=0.01$` and `$P=0.008<0.01=\alpha$`, we reject the null hypothesis `$H_0$`. We conclude that, at the significance level 0.01, there is enough evidence to claim that the coin is unfair.

???
Draw a normal curve to show the rejection region.

---

## Example: Proportion of Newborns (1/2)

Globally the long-term proportion of newborns who are male is 51.46%. A researcher believes that the proportion of boys at birth changes under severe economic conditions. To test this belief randomly selected birth records of 5,000 babies born during a period of economic recession were examined. It was found in the sample that 52.55% of the newborns were boys. Determine whether there is sufficient evidence, at the 10% level of significance, to support the researcher’s belief.

**Solution:** Since `$n\hat{p}\approx 2628$` and `$n(1-\hat{p})\approx 2372$`, a `$Z$`-test is valid. we will use the `$P$`-value to test the hypothesis.

To test the researcher's claim, we set the null hypothesis as `$H_0$`: `$p_0=0.5146$`. The experiment suggests that we should set the alternative hypothesis as `$H_a$`: `$p_0\neq 0.5146$`.

The standard test statistic is
$$
z=\dfrac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}=\dfrac{0.5255-0.5146}{\sqrt{0.5146*(1-0.5146)/5000}}=1.5422.
$$

---

## Example: Proportion of Newborns (2/2)

**Solution: (Continued)** From `$H_a$`, we know that the test is two-tailed. The `$P$`-value is then
`$$P=2*(1-P(Z<0.5255))\approx 0.124.$$`

Since the significance level is `$\alpha=0.1$` and the `$P$`-value `$P=0.122>0.1=\alpha$`, we fail to reject the null hypothesis `$H_0$`.

At the significance level 0.01, there is not enough evidence to support the researcher's belief that the proportion of newborns who are male changes.

---

## A Remark on the SE for Sample Proportion in Hypothesis Testing

In some books, the standard error of the sample distribution of sample proportions assuming that `$p=p_0$` is calculated using the approximation
$$
\sigma_{\hat{p}}=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.
$$

An arguable explanation is that using the above value for SE will be consistent with the approach to a hypothesis testing using a confidence interval in the case that a two-tailed test is preformed.

---

## Practice: Testing the Mean Lift Weight (with Known SD)

A college football coach thought that his players could bench press a mean weight of 275 pounds. It is known that the standard deviation is 55 pounds. Three of his players thought that the mean weight was more than that amount. They asked 30 of their teammates for their estimated maximum lift on the bench press exercise. The mean of their maximum lift is 286.2.

Conduct a hypothesis test using a 2.5% level of significance to determine if the bench press mean is more than 275 pounds.

.footmark[
  Source: [Module 9 in Introductory Statistics (Lumen)](https://courses.lumenlearning.com/introstats1/chapter/additional-information-and-full-hypothesis-test-examples/).
]

---

## Practice: Testing the Mean Age of Students (with Unknown SD)

In a college report, it says the mean age of students is 23.4 years old. An instructor thinks that the mean age is younger than 23.4. He randomly surveyed 50 students and found that the sample mean is 21.5 and the standard deviation is 1.9. At the significance level `$\alpha=0.025$`, is there enough evidence to support the instructor's estimation?

---

## Practice: Testing the Average Household Size

The average household size in a certain region several years ago was 3.14 persons. A sociologist wishes to test, at the 5% level of significance, whether it is different now. Perform the test using the information collected by the sociologist: in a random sample of 75 households, the average size was 2.98 persons, with sample standard deviation 0.82 person.

.footmark[
  [Exercise 11 in Section 8.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s12-02-large-sample-tests-for-a-popul.html).
]

---

## Practice: Testing the Mean Placement Test Score

The mean score on a 25-point placement exam in mathematics used for the past two years at a large state university is 14.3. The placement coordinator wishes to test whether the mean score on a revised version of the exam differs from 14.3. She gives the revised exam to 30 entering freshmen early in the summer; the mean score is 14.6 with standard deviation 2.4.

1. Perform the test at the 10% level of significance using the critical value approach.
2. Compute the observed significance of the test.
3. Perform the test at the 10% level of significance using the p-value approach.

.footmark[
  [Exercise 9 in Section 8.3 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s12-03-the-observed-significance-of-a.html).
]

---

## Practice: Testing the Mean Recovery Time

The average number of days to complete recovery from a particular type of knee operation is 123.7 days. From his experience a physician suspects that use of a topical pain medication might be lengthening the recovery time. He randomly selects the records of seven knee surgery patients who used the topical medication. The times to total recovery were:

Assuming a normal distribution of recovery times, perform the relevant test of hypotheses at the 10% level of significance.

Would the decision be the same at the 5% level of significance?

.footmark[
  [Exercise 15 in Section 8.4 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s12-04-small-sample-tests-for-a-popul.html).
]

---

## Lab: Excel Functions for Normal Distributions

- Let `$Z$` be a standard normal random varaible. In Excel, `$P(Z<z)$` is given by `NORM.S.DIST(z,TRUE)`.

- When a cumulative probability `$p=P(X<x)$` of a normal random variable `$X$` is given, we can find `$x$` using `NORM.INV(p,mean,sd)`.

- When a cumulative probability `$p=P(Z<z)$` of a standard normal random variable `$Z$` is given, we can find `$z$` using `NORM.S.INV(p)`.

---

## Lab: Excel Functions for `$T$`-Distributions

Suppose a Student's `$T$`-distribution has the degree of freedom `$\text{df}=n-1$`.

- To find a probability for a given `$T$`-value

- The area of the left tail of the `$T$`-value may be calculated by the function `T.DIST(t,df,true)`.

- The area of the right tail of the `$T$`-value may be calculated by the function `T.DIST.R(t, df)`.

- The area of two tails of the `$T$`-value (\$t>0\$) may be calculated by function `T.DIST.2T(t,df)`.

- To find the critical value for a given probability `$p$`