Sampling Distributions class: center, middle, inverse, title-slide .title[ # Sampling Distributions ] .subtitle[ ## MA336 Statistics ] .author[ ### Fei Ye Department of Mathematics and Computer Science ] .date[ ### July 2022 ] --- ## Learning Goals for Sampling Distribution - Demonstrate understanding of the sampling distribution of a statistic. - Explain how the central limit theorem applies in inference. - Determine whether a sampling distribution is approximately a normal distribution. - Calculate key characteristics (mean, standard error) of the sampling distribution of a statistic. - Estimate the probability of an event using the sampling distribution. --- ## Sampling Distribution - When using sample statistics to estimate population parameter, there will be a chance error `$$\text{Population Parameter}=\text{Sample Statistic}+\text{Chance Error}.$$` - To understand the chance error, we need to know how sample statistics distribute. Consider samples of the same size `\(n\)` randomly chosen from the population with replacement. - The probability distribution of a sample statistic is called a **sampling distribution**. --- ## Visualization: Sampling Distribution from a Discrete Random Variable <iframe src="https://istats.shinyapps.io/SampDist_discrete/" width="100%" height="550px" data-external="1"></iframe> .footmark[ Source: [https://istats.shinyapps.io/SampDist_discrete/](https://istats.shinyapps.io/SampDist_discrete/) ] --- ## Visualization: Sampling Distribution from a Continuous Random Variable <iframe src="https://istats.shinyapps.io/sampdist_cont/" width="100%" height="550px" data-external="1"></iframe> .footmark[ Source: [https://istats.shinyapps.io/sampdist_cont/](https://istats.shinyapps.io/sampdist_cont/) ] --- ## Sample Size Affects Standard Error - The sampling distribution varies as the sample size changes. In general, A larger sample size will result a smaller standard deviation of the sampling distribution. - The standard deviation of a sampling distribution is also called the **standard error**. --- ## Central Limit Theorem for Mean - **The Central Limit Theorem:** As the sample size `\(n\)` increases, the sampling distribution of the sample mean, from a population with the mean `\(\mu\)` and the standard deviation `\(\sigma\)`, will approach to a normal distribution with the mean `\(\mu_{\bar{X}}=\mu\)` and the standard deviation `\(\sigma_{\bar{X}}=\dfrac{\sigma}{\sqrt{n}}\)`. - **Remark:** In terms of standardization, the central limit theorem says that the random variable `\(\bar{Z}=\dfrac{\bar{x}-\mu}{\sigma/\sqrt{n}}\)` has an approximately standard normal distribution. --- ## Visualization: Central Limit Theorem for Mean <iframe src="https://seeing-theory.brown.edu/probability-distributions/index.html#section3" width="100%" height="550px" data-external="1"></iframe> .footmark[ Source: [https://seeing-theory.brown.edu/probability-distributions/index.html#section3](https://seeing-theory.brown.edu/probability-distributions/index.html#section3) ] --- ## Required Sample Size to Apply the Central Limit Theorem for Mean - For most distributions (not highly skewed), when sample size `\(n>30\)`, the sampling distribution of the sample mean `\(\bar{X}\)` can be approximated reasonably well by a normal distribution. The larger the sample size, the better the approximation will be. - When the population is normally distributed, the sampling distribution of the sample means will be normally distributed for any sample size. - If the population distribution is highly skewed, relying on CLT can be risky. ??? - [See the discussion on intuitive explanation.](https://stats.stackexchange.com/questions/3734/what-intuitive-explanation-is-there-for-the-central-limit-theorem/3904#3904) --- ## Example: Sampling Distribution of a Small Data Set (1/2) Randomly draw samples of size 2 with replacement from the numbers 1, 3, 4. - List all possible samples and calculate the mean of each sample. - Find the mean, and standard deviation of the sample means. - Find the mean, and standard deviation of the population. -- **Solution:** Using the Excel function `AVERAGE()`, we may find means of samples and means of sample means. Using the Excel function `STDEV.P()`, we may find the standard deviation of the population and the standard deviation of sample means. .center[ .left-column[ | `\(\color{red}{\mu}\)` | `\(\color{red}{\sigma}\)` | `\(\color{blue}{\mu_{\bar{X}}}\)` | `\(\color{blue}{\sigma_{\bar{X}}}\)` | | --------- | ----- | ----- | ----- | | .red[2.7] | .red[1.25] |.blue[2.7]|.blue[0.88]| ] .right-col[ | sample | (1,1) | (1,3) | (1,4) | (3,1) | (3,3) | (3,4) | (4,1) | (4,3) | (4,4) | | --------- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | `\(\bar{X}\)` | 1 | 2 | 2.5 | 2 | 3 | 3.5 | 2.5 | 3.5 | 4 | ] ] It can be verified that `\(\mu_{\bar{X}}=\mu\)` and `\(\sigma/\sqrt{n}=1.25/\sqrt{2}\approx 0.88=\sigma_{\bar{X}}\)`. --- ## Example: Sampling Distribution of a Small Data Set (2/2) **Solution: (Continued)** The following are the distribution of the population and the distribution of sample means. .pull-left[ .center[ <img src="data:image/png;base64,#MA336-Week-9-Sampling-Distributions_files/figure-html/unnamed-chunk-2-4.png" width="1500" /> ] ] .pull-right[ .center[ <img src="data:image/png;base64,#MA336-Week-9-Sampling-Distributions_files/figure-html/unnamed-chunk-3-1.png" width="1500" /> ] ] --- ## Example: Mean Length of Time on Hold **Example:** Suppose the mean length of time that a caller is placed on hold when telephoning a customer service center is 23.8 seconds, with standard deviation 4.6 seconds. Find the probability that the mean length of time on hold in a random sample of 1,000 calls will be within 0.5 second of the population mean. -- **Solution:** Since the sample size `\(n=1000>30\)` is large enough, by the Central Limit Theorem, we know that the mean length of time is approximately normally distributed. The mean of the sampling distribution is `\(\mu_{\bar{X}}=\mu=23.8\)`. The standard deviation of the sampling distribution is `\(\mu_{\bar{X}}=\dfrac{\sigma}{\sqrt{n}}=\dfrac{4.6}{\sqrt{1000}}\)`. The probability is $$ P(23.8-0.5<\bar{X}<23.8+0.5) = P(\bar{X}<24.3)-P(\bar{X}<23.3) \approx 0.9994 $$ which can be obtained by the following Excel formula: `NORM.DIST(24.3, 23.8, 4.6/SQRT(1000),TRUE)-NORM.DIST(23.3, 23.8, 4.6/SQRT(1000),TRUE)` --- ## Example: Normal Distribution vs Sampling of Normal Distribution Suppose speeds of vehicles on a particular stretch of roadway are normally distributed with mean 36.6 mph and standard deviation 1.7 mph. - Find the probability that the speed `\(X\)` of a randomly selected vehicle is between 35 and 40 mph. - Find the probability that the mean speed `\(\bar{X}\)` of 10 randomly selected vehicles is between 35 and 40 mph. **Solution:** Since the population is normally distributed `\(\mu=36.6\)` and `\(\sigma=1.7\)`, the sampling distribution of the sample mean is also normal distributed but with `\(\mu_{\bar{x}}=\mu=36.6\)` and `\(\sigma_{\bar{X}}=\sigma/\sqrt{n}=1.7/\sqrt{10}\)`. The probability that the speed of a vehicle is between 35 and 40 is `\(P(35< X< 40)=P(X< 40)-P(X<35)\approx 0.8039\)` which can be obtained by `NORM.DIST(40, 36.6, 1.7, TRUE)-NORM.DIST(35, 36.6, 1.7, TRUE)`. The probability getting a sample of size 10 with the mean between 35 and 40 is `\(P(35<\bar{X}< 40)=P(\bar{X}< 40)-P(\bar{X}<35)\approx 0.9985\)` which can be obtained by `NORM.DIST(40, 36.6, 1.7/SQRT(10), TRUE)-NORM.DIST(35, 36.6, 1.7/SQRT(10), TRUE)` --- ## Sampling Distribution of a Sample Proportion - When working with categorical variables, we often study the proportion of a data set. The proportion of a specific characteristic in a data set can be viewed as the mean of the data set by identifying the specific characteristic with 1 and others with `\(0\)`. **Example:** Consider the following data set .center[.red[1], 0, .red[1], .red[1], 0, 0, .red[1], 0, .red[1], .red[1]] Find proportion of .red[red numbers] and the mean of the data set. **Solution:** The proportion of red numbers is `\(\frac{6}{10}=0.6\)`. So is the mean: `\(\frac{6\cdot 1 + 4\cdot 0}{10}=0.6\)`. - Consider a population consisting of 1s and 0s. Let `\(p\)` be the proportion of 1s. Then standard deviation is `$$\sigma=\sqrt{(1-p)^2p+(0-p)^2(1-p)}=\sqrt{p(1-p)}.$$` --- ## Central Limit Theorem for Proportion - For a sampling distribution of sample proportion, we write `\(\hat{P}\)` for the random variable of sample proportions. - **Central Limit Theorem for Proportion:** For large samples, the distribution of sample proportions `\(\hat{P}\)` is approximately normal, with the mean `\(\mu_{\hat{P}}=p\)` and standard deviation `\(\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}\)`, where `\(p\)` is the population proportion. --- ## Required Sample Size to Apply the Central Limit Theorem for Proportion - As a sample proportion is always between 0 and 1, and 99.7% of sample proportions lie within 3 standard deviation away from the population proportion, when using the central limit theorem for proportion, we require the sample size `\(n\)` satisfying the following condition: the interval `\(\left[p-3\sqrt{\frac{p(1-p)}{n}}, p+3\sqrt{\frac{p(1-p)}{n}}\right]\)` lies wholly in the interval `\([0, 1]\)`. - In practice, if `\(n\)` satisfies the following two inequalities: `\(np\ge 10\)` and `\(n(1-p)\ge 10\)`, then we consider `\(n\)` is large enough for assuming that the sampling distribution of the sample proportion is approximately normal. - When the population proportion `\(p\)` is unknown, to apply the central limit theorem for proportion, we require the sample size `\(n\)` satisfying the same conditions with `\(p\)` replaced by the sample proportion `\(\hat{p}\)`. That is, the sample size `\(n\)` should satisfies `\(n\hat{p}\ge 10\)` and `\(n(1-\hat{p})\ge 10\)`. --- ## Example: Sampling Voters Suppose that in a population of voters in a certain region 53% are in favor of a particular law. Nine hundred randomly selected voters are asked if they favor the law. Find the probability that the sample proportion computed from a random sample of size 900 will be at least 2% above true population proportion. -- **Solution:** We first verify that the sampling distribution is approximately normal. Since `\(p=0.53\)` and `\(n=900\)`, `\(np=900\cdot 0.53>10\)` and `\(n(1-p)=900(1-0.53)>10\)`. By the central limit theorem, the sampling distribution is approximately normal. The standard deviation of the sampling distribution is `\(\sigma_{\hat{P}}=\sqrt{\frac{0.53(1-0.53)}{900}}\approx 0.017\)`. Then the probability that the random sample has a proportion at least 2% above 53% is `$$P(\hat{P}>0.55)=1-P(\hat{P}\le 0.55)\approx 0.1197$$` which can be obtained by `1-NORM.DIST(0.55, 0.53, SQRT(0.53*(1-0.53)/900),TRUE)`. --- ## Example: Traffic Accidents Caused by Distraction Suppose that in 36% of all car accidents involve injury. Find the probability that the injury rate in a random sample of 250 car accidents is between 30% and 45%. -- **Solution:** Firs we verify that the sample size is large enough to assume the sample proportion is approximately normally distributed by the Central Limit Theorem. The injury rate of all car accidents is `\(p=10\%=0.3\)` and the sample size is `\(250\)`. Because `\(np=250\cdot 0.36=90>10\)` and `\(n(1-p)=250\cdot(1-0.36)=160>10\)`, the sample size is considered large enough. Let `\(\hat{P}\)` be the sample proportion of a random sample. By the Central Limit Theorem, the distribution of `\(\hat{P}\)` is approximately normally with the mean `\(p=0.36\)` and standard deviation `\(\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}\approx 0.03\)` Using the Excel function, `NORM.DIST(x, mean, SD, TRUE)`, we find the probability of a random sample of 250 car accidents with the injury rate between 30% and 45% is $$ P(0.30<\hat{P}<0.45)=P(\hat{P}<0.45)-P(\hat{P}<0.30) \approx 0.999-0.023 =0.976 $$ --- ## Practice: Sample mean within an interval An unknown distribution has a mean of 28 and a standard deviation 6. Samples of size n = 30 are drawn randomly from the population. Find the probability that the sample mean is between 27 and 30. --- ## Practice: Sample mean of GPA The numerical population of grade point averages at a college has mean 2.61 and standard deviation 0.5. If a random sample of size 100 is taken from the population, what is the probability that the sample mean will be between 2.51 and 2.71? .footmark[ Source: [Example 4 in Section 6.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-02-the-sampling-distribution-of-t.html) ] --- ## Practice: Proportion of red candy <iframe src="https://www.myopenmath.com/embedq2.php?id=178683&seed=2020&showansafter" width="100%" height="560px" data-external="1"></iframe> --- ## Practice: Proportion of voting In a mayoral election, based on a poll, a newspaper reported that the current mayor received 45% of the vote. If this is true, what is the probability that a random sample of 100 voters had less than 35% voting for the current mayor? --- class: center, middle # More Practice on Sampling Distributions --- ## Practice: Sampling Distribution of Mean with Unknown Population Distribution A population has mean 73.5 and standard deviation 2.5. 1. Find the mean and standard deviation of `\(\bar{X}\)` for samples of size 30. 2. Find the probability that the mean of a sample of size 30 will be less than 72. .footmark[ Source: [Exercise 3 in Section 6.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-02-the-sampling-distribution-of-t.html). ] --- ## Practice: Sampling Distribution of Mean with Normal Population Distribution A normally distributed population has mean 57.7 and standard deviation 12.1. 1. Find the probability that a single randomly selected element X of the population is less than 45. 2. Find the mean and standard deviation of `\(\bar{X}\)` for samples of size 16. 3. Find the probability that the mean of a sample of size 16 drawn from this population is less than 45. .footmark[ Source: [Exercise 6 in Section 6.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-02-the-sampling-distribution-of-t.html). ] --- ## Practice: Cholesterol Level in Large Eggs Suppose the mean amount of cholesterol in eggs labeled “large” is 186 milligrams, with standard deviation 7 milligrams. Find the probability that the mean amount of cholesterol in a sample of 144 eggs will be within 2 milligrams of the population mean. .footmark[ Source: [Exercise 15 in Section 6.2 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-02-the-sampling-distribution-of-t.html). ] --- ## Practice: Color Blindness Rate Suppose that 8% of all males suffer some form of color blindness. Find the probability that in a random sample of 250 men at least 10% will suffer some form of color blindness. .footmark[ Source: [Exercise 13 in Section 6.3 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-03-the-sample-proportion.html). ] --- ## Practice: Testing an Airline's Claim An airline claims that 72% of all its flights to a certain region arrive on time. In a random sample of 30 recent arrivals, 19 were on time. You may assume that the normal distribution applies. 1. Compute the sample proportion. 2. Assuming the airline’s claim is true, find the probability of a sample of size 30 producing a sample proportion so low as was observed in this sample. .footmark[ Source: [Exercise 17 in Section 6.3 in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s10-03-the-sample-proportion.html). ] --- ## Practice: Minimal Mean Weight of a Particular Fruit <iframe src="https://www.myopenmath.com/embedq2.php?id=33379&seed=2020&showansafter" width="100%" height="560px" data-external="1"></iframe> <!-- ## Quiz 7 1. State the Central Limit Theorem for Proportion. 2. Determine whether the given sampling distribution is approximately normal. **Please explain!** 1. The sampling distribution of means of samples of size 50. 2. The sampling distribution of proportions of samples of size 50. --> --- class: center middle # Lab Instructions in Excel --- ## The `NORM.DIST()` Function - Let `\(X\)` be a normal random variable with mean `\(\mu\)` and standard deviation `\(\sigma\)`, that is `\(X\sim \mathcal{N}(\mu, \sigma^2)\)`. In Excel, `\(P(X<x)\)` is given by `NORM.DIST(x, mean, sd, TRUE)`. - Recall the mean of a data set can obtained by the Excel function `AVERAGE()`. - Given the population mean `\(\mu\)` and standard deviation `\(\sigma\)`, if the sample size `\(n\)` is bigger than 30 and the sample mean is `\(\bar{x}\)`. The probability of getting another sample of the same size but smaller mean can be obtained by the following Excel function: `NORM.DIST(` `\(\bar{x},\mu,\sigma\)` `/sqrt(n),TRUE)`.