MA336 Statistics
+ - 0:00:00
Notes for current slide
Notes for next slide

MA336 Statistics

Fei Ye

Department of Mathematics and Computer Science

December 2020

1 / 377

Statistical Studies

3 / 377

Learning Goals for Statistical Studies

  • Distinguish between a population and a sample.

  • Determine whether a study is an observational study or an experiment.

  • Determine the goal of a statistical study and what types of conclusions are appropriate.

  • Recognize typical forms of sampling biases such as convenience sample and voluntary response.

  • Explain why randomization should be used and describe how to implement a randomized design: Simple random sample, Stratified random sample, Cluster random sample, Systematic random sample.

  • Determine whether the conclusion of an experiment design is appropriate.

4 / 377

The Big Picture

A picture show how statistics works

5 / 377

Basic statistical concepts (1/3)

  • Data consists of information from observation, counts, measurements, responses or experiments.

  • A population is the collection of all objects that are of interest.

  • A parameter is a number that is a property of the population.

  • A sample is a subset of a population.

  • A statistic is a number, such as a percentage, that represents a property of a sample.

  • In statistics, a variable is a characteristic, or attribute of interest that we gather about individuals or objects. There are two types of variables according to their values.

    • Categorical variables (or qualitative variables) represent attributes, labels or nonnumerical entries, such as names, and colors.

    • Quantitative variables represent numerical measurements or counts, such as weights and number of students in each class.

6 / 377

Basic statistical concepts (2/3)

  • Example: Determine if the group is a population or sample

    1. The grade of all students in a math class.
    2. 10 students in a math class earned "A".
  • Answer:

    1. Population,
    2. Sample.
  • Example: Identify statistic concepts in the following study: To learn the percentage of students go to school by public transportation, 500 students at a college were survey. 50% say they go to school by public transportation

  • Answer:

    • Population: all students at the college
    • Sample: 500 being surveyed
    • Parameter: unknown percentage
    • Statistic: 50%
7 / 377

Basic statistical concepts (3/3)

  • Example: Identify the type variables.
Variable Type
Age Quantitative
Hair color Qualitative
GPA Quantitative
Education attainment (AS, BS, MS, etc.) Qualitative
8 / 377

Practice: basic statistical concepts

Identify the population, sample, the variable of study, the type of the variable, the population parameter and the sample statistics.

An administrator wishes to estimate the passing rate of a certain course. She takes a random sample of 50 students and obtains their letter grades of that course. Among the 50 students, 32 students earned a grade C or better.

9 / 377

Types of statistical studies (1/2)

  • A statistical study can usually be categorized as an observational study or an experiment by the mean of study.

    • An observational study observes individuals and measures variables of interest. The main purpose of an observational study is to describe a group of individuals or to investigate an association between two variables.

    • An experiment intentionally manipulates one variable in an attempt to cause an effect on another variable. The primary goal of an experiment is to provide evidence for a cause-and-effect relationship between two variables.

10 / 377

Types of statistical studies (2/2)

  • Example: Which type of study will answer the question.

    1. what proportion of all college students in the United States have taken classes at a community college?

    2. Does use of computer-aided instruction in college math classes improve test scores?

  • Answer: 1. Observational, 2.experimental

See Types of Statistical Studies (2 of 4) in the textbook Concepts in Statistics for more examples.

11 / 377

Practice: type of statistical study

Identify the type of statistical study:

  1. A study took random sample of adults and asked them about their bedtime habits. The data showed that people who drank a cup of tea before bedtime were more likely to go to sleep earlier than those who didn't drink tea.

    A. Observational
    B. Experimental

  2. Another study took a group of adults and randomly divided them into two groups. One group was told to drink tea every night for a week, while the other group was told not to drink tea that week. Researchers then compared when each group fell asleep.

    A. Observational
    B. Experimental

Source: Khan Academy

12 / 377

Questions about population (1/2)

Type of Research Question Examples
Make an estimate about the population (often an estimate about an average value or a proportion with a given characteristic) What is the average number of hours that community college students work each week? What proportion of all U.S. college students are enrolled at a community college?
Test a claim about the population (often a claim about an average value or a proportion with a given characteristic) Is the average course load for a community college student greater than 12 units? Do the majority of community college students qualify for federal student loans?
13 / 377

Questions about population (2/2)

Type of Research Question Examples
Compare two populations (often a comparison of population averages or proportions with a given characteristic) In community colleges, do female students have a higher GPA than male students? Are college athletes more likely than non-athletes to receive academic advising?
Investigate a relationship between two variables in the population Is there a relationship between the number of hours high school students spend each week on Facebook and their GPA? Is academic counseling associated with quicker completion of a college degree?
14 / 377

Question on cause-and-effect (1/2)

  • A research question that focuses on a cause-and-effect relationship is common in disciplines that use experiments, such as medicine or psychology.

    • Does cell phone usage increase the risk of developing a brain tumor?
    • Does drinking red wine lower the risk of a heart attack?
  • In a study of a relationship between two variables, one variable is the explanatory variable, and the other is the response variable.

  • To establish a cause-and-effect relationship, we want to make sure the explanatory variable is the only thing that impacts the response variable.

  • We therefore get rid of all other factors that might affect the response. These factors are called confounding variables. For example, taking a medicine could be a confounding variable in the second question above.

15 / 377

Question on cause-and-effect (2/2)

Example: Determine if the question is a cause-and-effect question? What are the explanatory and response variables?

  1. Does use of computer-aided instruction in college math classes improve test scores?
  2. Does tutoring correlate with improved performance on exams?

Answer:

  1. This question investigate a cause-and-effect relationship. The explanatory variable is computer-aided instruction and the response variable is the test scores.

  2. This question investigate a correlation between variables in a population and is not a cause-and-effect question. The explanatory variable is tutoring and the response variable is the performance.

16 / 377

Appropriate conclusions of a study (1 of 2)

  • In general, we should not make cause-and-effect statements from observational studies unless impact of confounding variables can be significantly decreased.

  • Example: A researcher studies the medical records of 500 randomly selected patients. Based on the information in the records, he divides the patients into two groups: those given the recommendation to take an aspirin every day and those with no such recommendation. He reports the percentage of each group that developed heart disease.

Determine whether the study supports the conclusion that taking aspirin lowers the risk of heart attacks.

  • Answer: The conclusion claims a cause-and-effect relationship. To answer the question, we need an experimental study. However, the study has no control on data which makes it inappropriate.
17 / 377

Practice: cause-and-effect

Does higher education attainment lead to higher salary?

  1. Determine if the question is a cause-and-effect question?
  2. What are the explanatory and response variables?
  3. If a student want to study this question, what type of statistical study can be used? What kind of conclusion can be drawn?
18 / 377

Sampling plans

To make accurate inference, the sample must be representative of the population.

  • A sampling plan describes exactly how we will choose the sample.

  • A sampling plan is biased if it systematically favors certain outcomes.

  • In random Sampling, every individual or object has an equal chance of being selected.

19 / 377

Methods of random sampling (1/2)

  • Simple random sample: groups of the same size are randomly selected. Table of random numbers, calculator and softwares are often used to generate random numbers. Random Table

  • Stratified random sample: The population is first split into groups. Then subjects from each group are selected randomly. Stratified Sample

20 / 377

Show how to generate a random number using the Excel function RANDBETWEEN() In the latest version of Excel, a new function RANDARRAY() is available.

Methods of random sampling (2/2)

  • Cluster sample: The population is first split into groups. Then some groups are selected randomly. Cluster Sample

  • Systematic sample: First, a starting number is chosen randomly. Then take every \(n\)-th piece of the data. Systematic Sample

21 / 377

Practice: sampling methods

Determine the type of sampling method.

  1. A market researcher polls every tenth person who walks into a store.

  2. 100 students whose student id numbers matches 100 numbers generated by a computer randomization program.

  3. The first 30 people who walk into a sporting event are polled on their television preferences.

22 / 377

Bad sampling

  • Biased sampling

    • Online polls. These are examples of a voluntary response sample.
    • Mall surveys. These are an example of a convenience sample.

    See Sampling (1 of 2) in the textbook for examples

  • Undercoverage

    • It occurs when some groups in the population are left out of the process of choosing a sample. For example, random survey math students to estimate the average GPA or a college.
23 / 377

Appropriate sampling design

  • Example: Suppose that you want to estimate the proportion of students at your college that use the library.

    Which sampling plan will produce the most reliable results?

    1. Select 100 students at random from students in the library.

    2. Select 200 students at random from students who use the Tutoring Center.

    3. Select 300 students who have checked out a book from the library.

    4. Select 50 students at random from the college.

  • Answer: The 4th sampling plan is the most reliable plan. The first three and undercover the college.

In general, the larger sample size, the more accurate of conclusion. However, we have to avoid bad sampling.

24 / 377

Elements of experimental design (1/2)

  • Control reduces the effects of extraneous variables (i.e., variables other than the independent variable and the dependent variable). These extraneous variables are called lurking variables.

  • Three control strategies are control groups, placebos, and blinding.

    • A control group is a baseline group that receives no treatment or a neutral treatment.

    • A neutral treatment that has no "real" effect on the dependent variable is called a placebo, and a participant's positive response to a placebo is called the placebo effect.

    • Blinding is the practice of not telling participants whether they are receiving a placebo. Double-blinding is the practice of not telling both both the participants and the researchers which group receiving a treatment or a placebo.

25 / 377

Elements of experimental design (2/2)

  • Randomization ensures that this estimate is statistically valid.

    • With random assignment, we can be fairly confident that any differences we observe in the response of treatment groups is due to the explanatory variable.
  • Replication reduces variability in experimental results and increases their significance.

    • Although randomization helps to insure that treatment groups are as similar as possible, the results of a single experiment, applied to a small number of objects or subjects, should not be accepted without question.
    • Any good experiment should be reproducible, and in particular, replication should yield similar results.
26 / 377

Confounding variable vs lurking variable

  • A confounding variable has at least a partial effect on the response variable.

  • Example: In the study of the relation between a type fertilizer and tomato size, the amount of sunshine will be a confounding variable. It contributes to the growth of tomato.

  • A lurking variable has an effect on both the explanatory and the response variables.

  • Example: People find that there is a positive association between number of firefighters and amount of damage. However, both are affected the size of fire.

Practice: experimental design

There is an ongoing debate about how many spaces should be placed after a period in typed documents. Alana read about a study where 100100100 participants all read the same document typed in Courier New font. Half of the participants were randomly assigned the document with one space after each period, and the other half were given the document with two spaces after each period.

Participants who read the document with two spaces after each period were able to finish reading significantly faster than those with one space after each period. Alana concluded that using two spaces after each period will help people read all documents faster.

Is this study appropriate? Why?

Source: Khan Academy

27 / 377

Lab: Random numbers by Excel (1/3)

  • Example: Randomly generate a number between 0 and 1.

    • Step 1: Choose a cell, say A1

    • Step 2: click insert function button \(f_x\).

    • Step 3: In the popup window, search "random" and select RAND.

    • Step 4: Click OK, you will get a randomly generated number.

    Alternatively, you may also manually enter the function: =rand() in the cell and hit enter.

28 / 377

Lab: Random numbers by Excel (2/3)

  • Example: Generate 10 random integers of 2 digits.

    • Step 1: Generate a random integer, say in the cell A1, using the Excel function randbetween(bottom,top).

    • Step 2: Move the mouse cursor to the lower right corner of the cell A1. A solid plus + will appear.

    • Step 3: Hold the left-click of the mouse and drag the cell to horizontally or vertically to get 10 numbers.

Using randbetween, you will find that some numbers were repeated. If you are using the latest version of Excel, you may use randarray to generate numbers without repetition.

29 / 377

Lab: Random numbers by Excel (3/3)

  • Example: Generate 10 random integers of 2 digits without repetition.

    • In the cell with 10 empty cells below it, say A1, apply the Excel function =randarray(rows,columns,min,max,integer). In this case, you should set rows=10, columns=1, min=10, max=99, and choose integer to be TRUE.
30 / 377

Table in Excel

  • Insert a table

    • Step 1: In the menu bar, select Insert.

    • Step 2: Look for Table and click it.

    • Step 3: In the popup windows, you may enter the two diagonal cell locators. For that, press Shift and select the two diagonal cells of you table.

    • Step 4: Click OK. You will see the table.

  • Remark: Tables are normally used for more than one variables, that are characteristic or attributes being studied, such as attendance rate and grade. In table, a column is usually used to put entries of a data set for a certain variable. Rows are used as labels of individual entries.

31 / 377

Insert or delete cells

  • Insert or delete cells, rows or columns

    • Step 1: Highlight by left clicking the cell(s), row, or column that you want to insert or delete.

    • Step 2: Right-click the highlighted cell, row or column

    • Step 3: In the popup window, select insert or delete and follow the instruction.

32 / 377

Install the Analysis ToolPak

  • We will use analysis toolpak frequently for analyzing data.

  • To install the add-in The Analysis ToolPak:

    • Step 1: In the Excel menu bar, select Home.

    • Step 2: Choose and click options

    • Step 3: In the popup window, choose and click Add-ins.

    • Step 4: In the new display, look for Manage: Excel Add-ins and click Go next to it.

    • Step 5: In the new popup windows, select The Analysis ToolPak and then click the OK button.

33 / 377

Summarizing Data Graphically

34 / 377

Learning Goals for Summarizing Data Graphically

  • Create and interpret graphs (dot plots, pie charts, histograms, or boxplots) as a means of summarizing and communicating data meaningfully.

  • Calculate and explain the purpose of measures of location (mean, median), variability (standard deviation, interquartile range).

  • Explain the impact of outliers on summary statistics such as mean, median and standard deviation.

35 / 377

Distribution of Quantitative Data

  • In data analysis, one goal is to describe patterns (known as the distribution) of the variable in the data set and create a useful summary about the set.

  • To describe patterns in data, we use descriptions of shape, center, and spread. We also describe exceptions to the pattern. We call these exceptions outliers.

Concepts used in the description of a distribution

36 / 377

Dot Plots

  • A dot plot includes all values from the data set, with one dot for each occurrence of an observed value from the set.
37 / 377

Dot Plots

  • A dot plot includes all values from the data set, with one dot for each occurrence of an observed value from the set.

Example: The data set contains 15 petal lengths of iris flower. Create a dot plot to describe the distribution of petal lengths.

1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2

37 / 377

Dot Plots

  • A dot plot includes all values from the data set, with one dot for each occurrence of an observed value from the set.

Example: The data set contains 15 petal lengths of iris flower. Create a dot plot to describe the distribution of petal lengths.

1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2

Solution: For each number in the data set, we draw a dot. We stack dots of the same value from bottom to up.

37 / 377

Practice: Heights Of Cherry Trees

The data set contains the heights of 20 Black Cherry Trees. Create a dot plot to describe the distribution of the heights.

64, 69, 71, 72, 74, 74, 75, 76, 76, 77, 78, 80, 80, 80, 80, 81, 82, 85, 86, 87

38 / 377

Pie Charts (1/2)

  • A pie chart is a pie with sectors represents categories and the area of each sector is proportional to the frequency of each category.
    • The frequency of a category is the number of occurrences of elements in the category.
    • The proportion of a frequency to the size of the population or the sample is also called the relative frequency.
39 / 377

Pie Charts (1/2)

  • A pie chart is a pie with sectors represents categories and the area of each sector is proportional to the frequency of each category.
    • The frequency of a category is the number of occurrences of elements in the category.
    • The proportion of a frequency to the size of the population or the sample is also called the relative frequency.

Example: The counts of majors of 100 students in a sample are shown in the table. Use a pie chart to organize the data.

Grade Frequency (Counts)
Art 30
Engineering 50
Science 20
39 / 377

Pie Charts (2/2)

Solution:

  • Find the relative frequency (percent) of each grade is shown in the following table.
Major Frequency Relative Frequency
Art 30 30%
Engineering 50 50%
Science 20 20%
  • The following shows the pie chart.

    A pie chart

40 / 377

Practice: Passengers on Titanic

The following data table summarize passengers on Titanic. Using a pie chart to describe the data table.

Class Passengers
1st 325
2nd 285
3rd 706
Crew 885
41 / 377

Histograms

  • A histogram divides values of a variable into equal-sized intervals called bins (classes in some books) and uses rectangular bars to show the frequency (count) of observations in each interval.

  • A frequency distribution is a table which contains bins, frequencies and/or relative frequencies which are proportions (percentage) defined by the formula Relative frequency=Class frequencySample size.

  • Each bin has a lower bin limit, which is the left endpoint of the interval, and an upper bin limit, which is the right endpoint of the interval.

  • The bin width is the distance between the lower (or upper) bin limits of two consecutive bins.

  • The difference between the maximum and the minimum data entries is called the range.

  • The midpoint of a bin is the half of the sum of the lower and upper limits of the bin.

42 / 377

Dot plots work well with small data sets. Because, each data entry is a bin that contains all entries with the same value.

Example: Histogram of mpg (1 of 2)

The following data set show the mpg (mile per gallon) of 30 cars. Construct a frequency table and frequency histogram for the data set using a bin width 4.

21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7

43 / 377

Example: Histogram of mpg (1 of 2)

The following data set show the mpg (mile per gallon) of 30 cars. Construct a frequency table and frequency histogram for the data set using a bin width 4.

21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7

Solution:

  • Find the maximum, minimum and range of the data set. In this example, the minimum is 10.4 , the maximum is 33.9, and the range is 33.9-10.4=23.5

  • Determine the number bins using the roundup of \(\frac{\text{range}}{\text{bin width}}\). In this example, the number of bins is \(\lceil \frac{23.5}{4}\rceil=6\).

  • Choose a starting point as the first lower bin limit. A convenient starting point is a value less than the minimum that has more accuracy than the data set. For example, in this data set, we may start with 10.35, then add the bin width to get all lower bin limits: 10.35, 14.35, 18.35, 22.35, 26.35, and 30.35.

43 / 377

Example: Histogram of mpg (2 of 2)

Solution:(continued)

  • The upper bin limit can be taken as the next lower bin limit. In this example, the upper bin limits can be taken as 14.35, 18.35, 22.35, 26.35, 30.35 and 34.35.

  • Record counts in bins and create the frequency distribution table.

  • Graph the histogram using the frequency distribution table.

Bin Frequency
10.35-14.35 4
14.35-18.35 9
18.35-22.35 8
22.35-26.35 4
26.35-30.35 1
30.35-34.35 4

44 / 377

Some Remarks on Histogram

  • Avoid histograms with large bin widths and small bin widths. See Histogram 2 of 4 in Concepts in Statistics for an interactive demonstration

  • When bin width is no given, we may first determine the number of bins. There are different approaches. For example, the Rice rule takes the bin number \(k = \lceil 2n^{1/3}\rceil\), where \(\lceil 2n^{1/3}\rceil\) is the roundup of \(2n^{1/3}\).

  • If the number of bins is \(k\), then we choose a number with the same or one more decimal place that is greater than \(\frac{\text{range}}{k}\), but no more than \(\frac{\text{range}}{k-1}\) as the bin width.

  • The area of a bar represents the relative frequency for the bin. There should no space between any two bars.

  • Bar charts are usually used to compare data sets from different categories. Histogram should not be bar chart.

  • See the Statistic How To page for more discussion on choosing bin width.

45 / 377

Show a bar chart to students in Excel.

Practice: Petal lengths of irises

The following data set show the petal length of 20 irises. Construct a frequency table and frequency histogram for the data set using 6 bins.

1.4, 5.4, 1.2, 4.5, 6.1, 1.5, 4.7, 1.4, 5.6, 5.2, 1.3, 6.3, 5.1, 5.6, 5, 6.7, 1.4, 1.6, 1.5, 1.5

46 / 377

Common Descriptions of Shape Distribution

  • Right skewed (reverse \(J\)-shaped): A right-skewed distribution has a lot of data at lower variable values. (Example: the histogram example.)

  • Left skewed ({ \(J\)-shaped): A left skewed distribution has a lot of data at higher variable values with smaller amounts of data at lower variable values.

  • Symmetric with a central peak (bell-shaped): A central peak with a tail in both directions. A bell-shaped distribution has a lot of data in the center with smaller amounts of data tapering off in each direction. (Example: the petal length example.)

  • Uniform: A rectangular shape, the same amount of data for each variable value.

  • For examples of left skewed and uniform distributions, please see the example in dotpolt 2 of 2 in Concepts in Statistics

47 / 377

Practice: Shapes of Distributions

Statistics are used to compare and sometimes identify authors. The following lists shows a simple random sample that compares the letter counts for three authors.

Terry: 7, 9, 3, 3, 3, 4, 1, 3, 2, 2

Davis: 3, 3, 3, 4, 1, 4, 3, 2, 3, 1

Maris: 2, 3, 4, 4, 4, 6, 6, 6, 8, 3

Create a dot plot for each sample and describe the shape of the distribution of each sample.

48 / 377

Mean and Median for Distributions in Different Shapes

49 / 377

Practice: Choose Appropriate Measure of Center

A student survey was conducted at a major university. The following histogram shows distribution of alcoholic beverages consumed in a typical week.

  1. What is the typical number of drinks a student has during a week?
  2. Do the data suggest that drinking is a problem in this university?

    The red line is over the median and the blue line is over the mean.
50 / 377

Lab: Create Frequency Tables (1 of 2)

In Excel, to create a frequency table for a data array, we need a bin array which is used to split the date set into smaller intervals. The values in a bin array in Excel are (upper) boundaries of intervals. With a data array and a bin array, we can use the Excel function FREQUENCY (data_array, bins_array) to create a frequency table.

Suppose the data set is in column A and the bin array is in column B. Here is how to create a frequency table using the function FREQUENCY (data_array, bins_array):

  1. In column C, right to the smallest value of the bin array enter =FEQUENCY(
  2. select the data values
  3. in the formula bar, enter the symbol comma ,
  4. select the bin array
  5. in the formula bar, enter ).

Hit the Enter, you will get a frequency table.

51 / 377

Lab: Creating Charts in Excel

Excel has many built-in chart functions. To create a charts,

  1. Select the data array/table
  2. Under the Insert tab, click on an appropriate chart in the Charts command set.

The appearance of chart can be changed after being created.

52 / 377

Lab: Create Histogram Charts in Excel (1 of 3)

  1. Select the data

  2. On the Insert tab, in the Charts group, from the Insert Statistic Chart dropdown list, select Histogram:

    Note: The histogram contains a special first bin which always contains the smallest number. This is different from many textbooks.

53 / 377

Lab: Create Histogram Charts in Excel (2 of 3)

To format the histogram chart is similar to format a Pie chart. For example, you can change bin width from Format Axis.

  1. Right-click on the horizontal axis and choose Format Axis in the popup menu:

  2. In the Format Axis pane, on the Axis Options tab, you may try different options for bins.

54 / 377

Lab: Create Histogram Charts in Excel (3 of 3)

Remark:

  • Excel using a different convention to create histogram. The first bin is a closed interval and other bins are left open and right closed intervals.

  • Select the Overflow bin checkbox and type the number, all values above this number will be added to the last bin.

  • Select the Underflow bin checkbox and type the number, all values below and equal to this number will be added to the first bin.

  • Histograms show the shape and the spread of quantitative data. For categorical data, discrete by its definition, bar charts are usually used to represent category frequencies.
55 / 377

Lab: Create Histogram Charts in Excel using the Analysis ToolPak

Suppose your data set is in Column A in Excel.

  • In the cell B1, put the first lower bin limit, which is a number slightly less than the minimum but has more decimal places than the data set.

  • Create upper bin limits in column C.

  • In Data menu, look for the Data Analysis ToolPak (if not, go to File > Options > Add-ins > Manage Excel Add-ins, check Analysis ToolPak). In the popup windows, find Histogram.

  • In the input range, select your data set. In the bin range, select upper bins.

  • Check Chart Output and hit OK. You will see the frequency table and histogram in Sheet 2.

  • Change the gap between bars. Right click a bar and choose Format Data Series... and change the Gap Width to 2% or 1%.

56 / 377

Lab: How to Create a Dotplot in Excel

  • If you have a raw data set, follow the same procedure a creating a histogram but with a bin width equal the same accuracy of the data. For example, if you data set consists of integers, then choose 1 as the bin-width.

  • Change the format of bars in the histogram.

    • Right click a bar and select Format Data Series....

    • Find Fill & Line and select both Picture or texture fill and Stack and Scale with.

    • Click the button Oneline... and input dot in search bing and hit enter.

    • Select a picture you like and you will get a dot-plot.

57 / 377

Lab: Practice

Use Excel to complete the following tasks:

  1. Create a random sample of 30 two-digit integers.

  2. Create a histogram with 6 bins for the sample of 30 two-digit integers.

  3. Create a dot plot for the sample of 30 two-digit integers.

Describe the shape of the distribution of the sample of 30 two-digit integers.

58 / 377

Measure of Centeral Tendency and Spread

59 / 377

Learning Goals for Measure of Centeral Tendency and Spread

  • Create and interpret graphs (dot plots, pie charts, histograms, or boxplots) as a means of summarizing and communicating data meaningfully.

  • Calculate and explain the purpose of measures of location (mean, median), variability (standard deviation, interquartile range).

  • Explain the impact of outliers on summary statistics such as mean, median and standard deviation.

60 / 377

Measure of Centers

  • Mean: The mean is the average, this is the quotient of the total sum by the total number.

  • Median: The median is the middle of the data when all the values are listed in order. The median divides the data into two equal-sized groups.

  • Use the mean as a measure of center only for distributions that are reasonably symmetric with a central peak. When outliers are present, the mean is not a good choice.

  • Use the median as a measure of center for all other cases.

  • We need to use a graph to determine the shape of the distribution. So graph the data first.

  • Check the webpage on Skewness of Relative Frequency Histograms to see the positions of mean and median.

61 / 377

Notations and Calculations about Mean

  • Sigma notation: in math, we denote the sum of values \(x_1\), \(x_2\), \(\dots\), \(x_n\) of a variable \(x\) by \(\sum_{i=1}^n x_i\) or simply by \(\sum x\).

  • The population mean is \(\mu= \frac{\sum x}{N}\), where \(N\) is the population size, i.e the number of elements in the population.

    The notation \(\mu\) reads as mu.

  • The sample mean is \(\bar{x}=\frac{\sum{x}}{n}\), where \(n\) is the sample size. The notation \(\bar{x}\) reads as \(x\)--bar.

62 / 377

Example: Mean city mpg

Find the mean city mpg for a sample of 10 cars.

18, 21, 20, 21, 16, 18, 18, 18, 16, 20

63 / 377

Example: Mean city mpg

Find the mean city mpg for a sample of 10 cars.

18, 21, 20, 21, 16, 18, 18, 18, 16, 20

Solution: The mean is

$$\bar{x}=\frac{18+21+20+21+16+18+18+18+16+20}{10}=18.6.$$

The mean mpg of the 10 cars is 18.6 mpg.

63 / 377

Weighted Mean

  • The weighted mean of a set of numbers \(\{x_1, \dots, x_n\}\) with weights \(w_1\), \(w_2\), ..., \(w_n\) is defined as $$\frac{\sum w_ix_i}{\sum w_i}.$$

  • The mean of a frequency table is weighted mean \(\bar{x}=\frac{\sum f x}{n}\), where \(x\) is an element with frequency \(f\) and \(n\) is the sample size.

64 / 377

Example: Course overall grade

In a course, the overall grade is determined in the following way: the homework average counts for 10%, the quiz average counts for 10%, the test average counts 50% , and the final exam counts for 30%. What's the overall grade of the student who earned 92 on homework, 95 on quizzes, 90 on tests and 93 on the final.

65 / 377

Example: Course overall grade

In a course, the overall grade is determined in the following way: the homework average counts for 10%, the quiz average counts for 10%, the test average counts 50% , and the final exam counts for 30%. What's the overall grade of the student who earned 92 on homework, 95 on quizzes, 90 on tests and 93 on the final.

Solution: The overall grade is the weighted mean

$$\frac{\sum w_ix_i}{\sum w_i}=\frac{0.1\cdot 92+0.1\cdot 95+0.5\cdot 90+0.3\cdot 93}{0.1+0.1+0.5+0.3}=91.6.$$

65 / 377

Show how to use Excel

Practice: Mean petal width

Find the average petal width for a sample of 10 iris followers.

1.1, 0.2, 0.2, 1.2, 1.3, 0.2, 1.5, 1.9, 1.5, 1.8

66 / 377

Practice: Calculate a mean using the weighted mean formula

Find the mean from the dot plot of sepal length for a sample of 10 iris flowers.

67 / 377

Practice: Estimate the mean from a histogram

Estimate the average highway mpg using the histogram of a sample of 20 cars.

68 / 377

Practice: Weighted mean - calculate final grade

69 / 377

Median, Quartiles, Interquartile Range and Outliers

  • The three quartiles, Q1, Q2, and Q3 are numbers in an ordered data set that divide the data set into four equal parts. The second quartile is known as the median.

  • Interquartile Range (IQR for short) is the measure of variation when using the median to measure center. It is defined as the difference of the third and the first quartiles: IQR=Q3-Q1.

  • When the center and the spread are measured by the median and the IQR, a value in the data is considered an outlier if the value is

    • greater than Q3 + 1.5 \(\cdot\) IQR or
    • less than Q1 − 1.5 \(\cdot\) IQR.

    Note: An outlier in this definition is also called a mild outlier. An outlier that is more extreme than Q1 + 3 \(\cdot\) IQR or Q3 - 3 \(\cdot\) IQR is also called extreme outlier.

  • The minimum, Q1, Q2, Q3 and maximum are known as the "five-number summary" of the data set.

  • The difference of maximum and minimum is called the range.

70 / 377

Example: Median, IQR and Outliers

Find the median, quartiles, IQR and outliers (if they exist) of the sample height of 15 trees.

70, 65, 63, 72, 81, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75

71 / 377

Example: Median, IQR and Outliers

Find the median, quartiles, IQR and outliers (if they exist) of the sample height of 15 trees.

70, 65, 63, 72, 81, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75

Solution:

  • Sort the data set from small to large.

    63, 65, 66, 69, 70, 72, 75, 75, 75, 76, 76, 79, 80, 81, 83

  • Find the median i.e. Q2. The sample size is 15. The middle of the ordered data set is the \(\lceil 15/2 \rceil=8\)-th number which is 75.
  • Find Q1 and Q3. Q1 is the median of the numbers less than the median. Q3 is the median of the number greater than the median. In this example, Q1 is the 4-th number 69. Q3 is the 4-th to the last, that is 79.
  • IQR=Q3-Q1=79=69=10.
  • Since Q1-1.5IQR=69-1.5 \(\cdot\) 10=54 and Q3+1.5IQR=79-1.5 \(\cdot\) 10=94, there is no outlier in this sample.
71 / 377

Practice: Range and IQR

72 / 377

Box Plot

  • A box plot shows a "five-number summary" of the data set. It contains a box, two whiskers and dots (for outliers).

  • To create the boxplot for a distribution,

    • Draw a box from Q1 to Q3.

    • Draw a vertical line in the box at the median.

    • Extend a tail from Q1 to the smallest value that is not an outlier and from Q3 to the largest value that is not an outlier.

    • Indicate outliers with a solid dot.

73 / 377

Example: Box plot - ages of best oscar winners

Create the boxplot for the ages of 32 best actor oscar winners (1970–2001).

31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76

74 / 377

Example: Box plot - ages of best oscar winners

Create the boxplot for the ages of 32 best actor oscar winners (1970–2001).

31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76

Solution: We may use Excel to find the five-number summary.

  • Q2=42.5, Q1=37.5, Q3=49.5, IQR=12, 1.5IQR=18, Q1-1.5IQR= 19.5, Q3+1.5IQR=67.5

  • The smallest number that is not an outlier is 31. The largest number that is not an outlier is 61. Those two numbers bounds wiskers.

  • There is an outlier 76.

  • The boxplot is shown below.

74 / 377

Practice: the five number summary and the boxplot

75 / 377

Measure of Variation about Population Mean

  • The deviation of an entry \(x\) in a population data set is the difference \(x-\mu\), where \(\mu\) is the mean of the population.

  • The population variance of a population of \(N\) entries is defined as VAR.P=σ2=(xμ)2N.

  • The population standard deviation is STDEV.P=σ=(xμ)2N.

76 / 377

Measure of Variation about Sample Mean

  • The deviation of an entry \(x\) in a sample data set is the difference \(x-\bar{x}\), where \(\bar{x}\) is the mean of the sample.

  • The sample variance and sample standard deviation are defined similarly VAR.S=s2=(xx¯)2n1,STDEV.S=s=(xx¯)2n1, where \(n\) is the sample size.

  • Rounding rule: for mean, variance and standard deviation, we keep one more digits than the accuracy of the data set.

Note: To measure the spread, one may also use the mean absolute deviation $$MAD=\dfrac{\sum |x-\bar{x}|}{n}.$$ However, the standard deviation has better properties in applications.

77 / 377

Show how to use Excel to find SD

Example: Standard deviation - ages of oscar winners

Find the mean and standard deviation ages of a sample of 32 best actor oscar winners (1970–2001).

31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76

78 / 377

Example: Standard deviation - ages of oscar winners

Find the mean and standard deviation ages of a sample of 32 best actor oscar winners (1970–2001).

31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76

Solution: We use the Excel functions AVERAGE() and STDEV.S() to find the mean and sample standard deviation respectively. The mean is 44.7. The sample standard deviation is 10.3.

78 / 377

Practice: Standard deviation

A sample of GPAs from ten students random chosen from a college are recorded as follows. 1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.33

Find the standard deviation of this sample.

79 / 377

Mean and Standard Deivation under Linear Transformation

  • When we increase values in a data set by a fixed number \(c\), the standard deviation of a data set won't change. However, the mean increases by \(c\) too.

  • When we multiple values in a data set by a factor \(k\), the mean and the standard deviation both scale by the factor \(k\).

80 / 377

Effect of Changes of Data on Statistical Measures

81 / 377

Practice: Standard deviation under a transformation

A sample of the highest temperature of 10 days has a standard deviation \(5^\circ\mathrm{C}\) in Celsius.

  1. If we want to know the standard deviation in Feirenheit, do we need to recaculate using the sample?

  2. What is the standard deviation in Fahrenheit.

82 / 377

The Empirical Rule

If a data set has an approximately bell-shaped distribution, then

  1. approximately 68% of the data lie within one standard deviation of the mean.

  2. approximately 95% of the data lie within two standard deviations of the mean.

  3. approximately 99.7% of the data lies within three standard deviations of the mean.

Empirical Rule

83 / 377

Chebyshev’s Theorem

For any numerical data set, at least \(1−1/k^2\) of the data lie within \(k\) standard deviations of the mean, where \(k\) is any positive whole number that is at least 2.

Empirical Rule

84 / 377

Example: Applications of the Empirical Rule

A population data set with a bell-shaped distribution has mean \(\mu = 6\) and standard deviation \(\sigma = 2\). Find the approximate proportion of observations in the data set that lie:

  1. between 4 and 8;
  2. below 4.
85 / 377

Example: Applications of the Empirical Rule

A population data set with a bell-shaped distribution has mean \(\mu = 6\) and standard deviation \(\sigma = 2\). Find the approximate proportion of observations in the data set that lie:

  1. between 4 and 8;
  2. below 4.

Solution: Apply the Empirical Rule, there are 68% of data lie between 6-2=4 and 6+2=8. Since the distibution is symmetric, then 34% of data lie between 4 and 6, and 34% of data lie between 6 and 8. Then there are only 50%-34%=26% of data lie below 4.

85 / 377

Example: Applications of Chebyshev's Theorem

A sample data set has mean \(\bar{x}=6\) and standard deviation \(s = 2\). Find the minimum proportion of observations in the data set that must lie between 2 and 10.

Solution: Apply Chebyshev's theorem, there are 75% of data are between \(\bar{x}-2s=2\) amd \(\bar{x}+2s=10\).

86 / 377

Practice: The empirical rule

87 / 377

Practice: Chebyshev’s Theorem

A sample data set has mean \(\bar{x}=10\) and standard deviation \(s = 3\). Find the minimum proportion of observations in the data set that must lie between 1 and 19.

88 / 377

Practice: Change of Measures on Transformation of Data

A teacher decide to curve the final exam by adding 10 points for each student. Which of the following statistic will NOT change:
A. median, B. mean, C. interquartile range, D. standard deviation?
Please explain your conclusion.

89 / 377

Practice: Understand Standard Deviation From Graphs

Which distribution of data has the SMALLEST standard deviation? Please explain your conclusion.

Distributions with different standard deviation

90 / 377

Lab: How to Find the Mean, Median, Quartiles and Standard Deviation

  • To find the mean, you may use the function AVERAGE().

  • To find the median, you may use the function MEDIAN().

  • To find quartiles, you may use the function QUARTILE.EXC.

  • To find the population standard deviation, you may use the function STDEV.P().

  • To find the sample standard deviation, you may use the function STDEV.S().

91 / 377

Lab: How to Create a Boxplot in Excel

  • Select your data—either a single data series, or multiple data series.

  • Click Insert > Insert Statistic Chart > Box and Whisker to create a boxplot.

For more information, see Create a box and whisker chart in Excel 365

92 / 377

Lab: Practice - Car Speeds

Consider the following sample that consists of speeds of 20 cars.

# NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

Use Excel to answer the following questions

  1. Find the mean, median, quartiles and standard deviation of the sample.
  2. Create a boxplot to describe the data in the sample.
93 / 377

Linear Relationship

94 / 377

Learning Goals for Linear Regressions

  • Summarize and interpret the relationship between two quantitative variables.

  • Demonstrate understanding of concepts pertaining to linear regression.

  • Use regression equations to make predictions and understand its limits.

95 / 377

Scatterplots (1/5)

  • Correlation refers to a relationship between two quantitative variables:

    • the independent (or explanatory) variable, usually denoted by \(x\).

    • the dependent (or response) variable, usually denoted by \(y\).

  • Example: In a study of education attainment and annual salary, the years of education is the explanatory variable and the annual salary is the response variable.

  • To describe the relationship between two quantitative variables, statisticians use a scatterplot.

  • In a scatterplot, we describe the overall pattern with descriptions of direction, form, and strength.

96 / 377

Scatterplots (2/5)

  • Positive relationship: the response variable (y) increases when the explanatory variable (x) increases.

  • Negative relationship: the response variable (y) decreases when the explanatory variable (x) increases.

97 / 377

Scatterplots (3/5)

  • Linear form

  • Curvilinear form

  • No obvious relationship

98 / 377

Scatterplots (4/5)

  • The strength of the relationship is a description of how closely the data follow the form of the relationship.

A picture shows a strong relationship

A picture shows a weaker relationship

99 / 377

Scatterplots (5/5)

  • Outliers are points that deviate from the pattern of the relationship.

A picture shows a outlier to a relationship

100 / 377

Practice: Match Scatterplots

A: X = month (January = 1), Y = rainfall (inches) in Napa, CA in 2010 (Note: Napa has rain in the winter months and months with little to no rainfall in summer.)

B: X = month (January = 1), Y = average temperature in Boston MA in 2010 (Note: Boston has cold winters and hot summers.)

C: X = year (in five-year increments from 1970), Y = Medicare costs (in $) (Note: the yearly increase in Medicare costs has gotten bigger and bigger over time.)

D: X = average temperature in Boston MA (°F), Y = average temperature in Boston MA (°C) each month in 2010

E: X = chest girth (cm), Y = shoulder girth (cm) for a sample of men

F: X = engine displacement (liters), Y = city miles per gallon for a sample of cars (Note: engine displacement is roughly a measure of engine size. Large engines use more gas.)

101 / 377

The Correlation Coefficient - Definition (1/2)

  • The correlation coefficient \(r\) is a numeric measure that measures the strength and direction of a linear relationship between two quantitative variables. r=(xx¯sx)(yy¯sy)n1, where \(n\) is the sample size, \(x\) is a data value for the explanatory variable, \(\bar{x}\) is the mean of the \(𝑥\)-values, \(𝑠_x\) is the standard deviation of the \(𝑥\)-values, and similarly, for the notations involving 𝑦.

  • The expression \(z=\frac{x-\bar{x}}{s_x}\) is known as the standardized variable (or \(z\)-score) which

    • doesn't depend on the unit of the variable \(x\),
    • has mean \(0\) and standard deviation 1.
  • In Excel, the correlation coefficient can be calculated using the function CORREL().

  • Scatterplots with different correlation coefficients

102 / 377

r=xyxy=(xx¯sx)(yy¯sy)(xx¯sx)2(yy¯sy)2=(xx¯sx)(yy¯sy)n1,

The Correlation Coefficient - Definition (2/2)

  • Rounding Rule: Round to the nearest thousandth for \(r\), \(m\) and \(b\).

  • Geometric explanation of the definition of \(r\).

\(r=\) 0.816

\(r=0.420\)

  • Remark:
    • \(r>0\) if all points \((x-\bar{x}, y-\bar{y})\) are in the 1st and the 3rd quadrants.
    • \(r<0\) if all points \((x-\bar{x}, y-\bar{y})\) are in the 2nd and the 4th quadrants.
103 / 377

The Correlation Coefficient - Properties

104 / 377

Guess the Correlation Coefficient

105 / 377

The Correlation Coefficient - Example (1/2)

Describe the relationship between Midterm 1 and Final for a sample of 10 students.

Midterm1 Final
72 72
93 88
81 82
82 82
94 88
80 77
73 78
71 77
81 76
81 76
63 68
106 / 377

The Correlation Coefficient - Example (1/2)

Describe the relationship between Midterm 1 and Final for a sample of 10 students.

Midterm1 Final
72 72
93 88
81 82
82 82
94 88
80 77
73 78
71 77
81 76
81 76
63 68

Solution: First we create a scatterplot.

Using the Excel function CORREL(x, y), we find the correlation coefficient is \(r=0.905\) .

The \(r\)-value shows a strong positive linear relationship.

106 / 377

The Correlation Coefficient - Example (2/2)

  • \(r\) can also be calculation by hand using the formula. \(\dfrac{\sum z_xz_y}{n-1}\), where \(z_x=\frac{x-\bar{x}}{s_x}\) and \(z_y=\frac{y-\bar{y}}{s_y}\).
Midterm1 Final z_x z_y z_xy
72 72 -0.78006 -1.06926 0.834087814
93 88 1.50088 1.544483 2.318083715
81 82 0.197484 0.56433 0.111446332
82 82 0.306101 0.56433 0.172741815
94 88 1.609497 1.544483 2.485839773
80 77 0.088868 -0.25246 -0.02243591
73 78 -0.67145 -0.0891 0.059829084
71 77 -0.88868 -0.25246 0.224359064
81 76 0.197484 -0.41582 -0.08211835
81 76 0.197484 -0.41582 -0.08211835
63 68 -1.75761 -1.72269 3.027820885
79.18182 78.54545 <- mean sum -> 9.047535876
9.206717 6.121497 <- stdev.s correl -> 0.904753588
107 / 377

Practice: Years And Winning Times

The tables show a sample of 23 records on years and winning times for the 1,500 meter race in Olympic Games.

  • Draw a scatter plot for the data table.
  • Is it appropriate to study the relationship using a linear model.
  • Find and interpret the correlation coefficient.
Year Time
1900 246.0
1904 245.4
1908 243.4
1912 236.8
1920 241.8
1924 233.6
1928 233.2
1932 231.2
1936 227.8
1948 229.8
1952 225.1
1956 221.2
Year Time
1960 215.60
1964 218.10
1968 214.90
1972 216.30
1976 219.20
1980 218.40
1984 212.53
1988 215.96
1992 220.12
1996 215.78
2000 212.32

Source: suny-wmopen-concepts-statistics

108 / 377

Correlation v.s. Causation

  • Correlation is described by data from observational study. Observational studies cannot prove cause and effect which requires controlled study and rigorous inferences.

  • Correlation may be used to make a prediction which is probabilistic.

  • In a linear relationship, an \(r\)-value that is close to 1 or -1 is insufficient to claim that the explanatory variable causes changes in the response variable. The correct interpretation is that there is a statistical relationship between the variables.

  • A lurking variable is a variable that is not measured in the study, but affects the interpretation of the relationship between the explanatory and response variables.

109 / 377

Example Correlation v.s. Causation (1/2)

The scatterplot below shows the relationship between the number of firefighters sent to fires (x) and the amount of damage caused by fires (y) in a certain city.

Can we conclude that the increase in firefighters causes the increase in damage?

110 / 377

Example Correlation v.s. Causation (2/2)

Solution:

  1. Correlation: The more fire fighters, the more likely there is bigger damage. However the fire fighters do not cause the fire.

  2. Prediction: You could predict the amount of damage by looking at the number of fire fighters present.

  3. Causation: The fire fighters are unlikely the cause of the fire.

  4. Lurking variable: The seriousness of the fire is a lurking variable.

111 / 377

Lab: Scatter Plots and Correlation Coefficient

  • To create a scatter plot, first select the data sets, and then look for Insert Scatter(X, Y) in the menu Insert-> Charts.

  • The correlation coefficient \(r\) can be calculated by the Excel function correl().

112 / 377

Linear Regression

113 / 377

Learning Goals for Linear Regressions

  • Summarize and interpret the relationship between two quantitative variables.

  • Demonstrate understanding of concepts pertaining to linear regression.

  • Use regression equations to make predictions and understand its limits.

114 / 377

The Regression Line (1/2)

  • The line that best summarizes a linear relationship is the least squares regression line. The regression line is the line with the smallest sum of squares of the errors (SSE).

  • We use the least-squares regression line to predict the value \(\hat{y}\) for a value of the explanatory variable \(x\).

  • The regression line is unique and passes though \((\bar{x}, \bar{y})\). The equation is given by $$\hat{y}=m(x-\bar{x})+\bar{y}=m x+b,$$ where the slope is $$m=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sum(x-\bar{x})^2}=r\frac{s_y}{s_x}$$ and the \(y\)-intercept is \(b=\bar{y}-m\bar{x}.\)

115 / 377

The Regression Line (2/2)

  • The error of a prediction is $$\text{Error}=\text{Observed}-\text{Predicted}=y-\hat{y}.$$

  • A prediction beyond the range of the data is called extrapolation.

116 / 377

Example: Old Faithful Geyser (1/2)

The following sample is taken from data about the Old Faithful geyser.

  1. Study the linear relationship.
  2. Find the regression line, and the predicated value and the error if the eruption time is 1.8 minutes.
eruptions waiting eruptions waiting
3.917 84 1.75 62
4.200 78 4.80 84
1.750 47 1.60 52
4.700 83 4.25 79
2.167 52 1.80 51

117 / 377

Example: Old Faithful Geyser (2/2)

Solution: The Scatterplot shows a linear relationship.

To find the regression line, we use the Excel function SLOPE(). In this example, \(m= 10.836\).

To find \(m\) and use the Excel function INTERCEPT() to find the \(y\)-intercept \(b\). In this example, \(b= 33.68\).

The equation of the line is \(\hat{y}=10.836x + 33.68\).

When \(x=1.8\), we have \(\hat{y}=10.836*1.8 + 33.68= 53.1848\).

The error is \(y-\hat{y}=51-53.1848= -2.1848\). That means the predication over-estimates the eruption time about -2.18 minutes.

118 / 377

Practice: Years and Winning Times

The tables show a sample of 23 records on years and winning times for the 1,500 meter race in Olympic Games.

  • Is it appropriate to study the relationship using a linear model.
  • Find an equation of the regression.
  • Make a prediction of the winning time for the year 1998.
  • What is the residual for the year 1992.
  • Find and interpret the coefficient of determination.
Year Time
1900 246.0
1904 245.4
1908 243.4
1912 236.8
1920 241.8
1924 233.6
1928 233.2
1932 231.2
1936 227.8
1948 229.8
1952 225.1
1956 221.2
Year Time
1960 215.60
1964 218.10
1968 214.90
1972 216.30
1976 219.20
1980 218.40
1984 212.53
1988 215.96
1992 220.12
1996 215.78
2000 212.32

Source: suny-wmopen-concepts-statistics

119 / 377

Assessing the Fit of a Regression Line (1/2)

  • The prediction error is also called a residual. Another way to express the previous equation is $$y=\hat{y}+\text{residual}.$$

  • Residual plots are used to determine if a linear model is appropriate.

  • A random pattern (or no obvious pattern) indicates a good fit of a linear model. See Assessing the Fit of a Line (2 of 4) in Concepts in Statistics for examples.

  • One measure of the fit of a regression line is the proportion of the variation in the response variable that is explained by the least-squares regression line.

    • The total variance is \(SSD=\sum(y-\bar{y})^2\)
    • The explained variance is \(SSR=\sum(\hat{y}-\bar{y})^2\).
    • The coefficient of determination is $$r^2=\dfrac{SSR}{SSD}=\dfrac{\sum(y-\bar{y})^2}{\sum(\hat{y}-\bar{y})^2}.$$
120 / 377

Assessing the Fit of a Regression Line (2/2)

  • Another measure of the fit of the regression is the residual standard errors (or standard error of the regression), calculated by the Excel function STEYX(), is $$s_e=\sqrt{\dfrac{SSE}{n-2}},$$ where \(SSE=\sum (y-\hat{y})^2\) is the sum of square errors.

  • The smaller \(s_e\) is, the more accurate the prediction is.

Remark:

  • The \(r\) in the coefficient of determination is the correlation coefficient. Equivalently, \(r=\pm\sqrt{r^2}\).
  • The smaller the standard error, the larger the coefficient of determination: $$r^2=1-\dfrac{SSE}{SSD}=1-\dfrac{(n-2)s_e^2}{SSD}.$$
121 / 377
  • \(n−2\) is the degrees of freedom. We lose two degrees of freedom because we estimate the slope and the \(y\)-intercept.

  • In a linear regression model \(Y=\beta_0 + \beta_1 X +\epsilon\), even we have \(\beta_0\) and \(\beta_1\) from the population, we still need estimate the standard deviation of error.

Example: Coefficient of Determination

Find the coefficient of determination for the data of midterm1 and final

Midterm1 Final
72 72
93 88
81 82
82 82
94 88
80 77
73 78
71 77
81 76
81 76
63 68
122 / 377

Example: Coefficient of Determination

Find the coefficient of determination for the data of midterm1 and final

Midterm1 Final
72 72
93 88
81 82
82 82
94 88
80 77
73 78
71 77
81 76
81 76
63 68

Solution:

The correlation coefficient is \(0.905\).

The coefficient determination is $$r^2=0.905^2\approx 0.819.$$

122 / 377

Example: Residual Standard Errors

Find the residual standard error of the regression line for the data of midterm1 and final

Midterm1 Final
72 72
93 88
81 82
82 82
94 88
80 77
73 78
71 77
81 76
81 76
63 68
123 / 377

Example: Residual Standard Errors

Find the residual standard error of the regression line for the data of midterm1 and final

Midterm1 Final
72 72
93 88
81 82
82 82
94 88
80 77
73 78
71 77
81 76
81 76
63 68

Solution: In Excel we can use STEXY() to find the residual standard error. The residual standard errors of the regression in the Old Faithful example is \(s_e\approx 5.258\).

123 / 377

Lab: Slope, \(y\)-intercept, \(r^2\) and \(s_e\)

  • The slope of a linear regression can be calculated by the Excel function SLOPE().

  • The \(y\)-intercept of a linear regression can be calculated by the Excel function INTERCEPT().

  • The coefficient of determination can be calculated by first finding \(r\), then applying the formula r^2.

  • The standard error of the regression (residual standard error) can be calculated by the Excel function STEYX().

125 / 377

Two-Way Tables and Relations Between Categorical Variables

126 / 377

Learning Goals for Two-way Tables

  • Summarize and interpret the relationship between two two qualitative (categorical) variables using two-way tables.

  • Demonstrate understanding and find conditional, joint and marginal probability from a two-way frequency table.

  • Create and analyze two-way table to answer probability questions.

127 / 377

Two-way Frequency Tables (1/2)

  • As we organize and analyze data from two categorical variables, we make use of two-way tables.

  • Information in a two-way frequency table:

    • Values of the two variables are displayed in the left column and the top row.

    • The body of table consists of frequency counts associated to pairs of values of the two variables.

    • The right column and the bottom row, which are called margins of the table, consists of row totals and column totals respectively.

128 / 377

Two-way Frequency Tables (2/2)

  • A number in a margin are called marginal frequency or marginal distribution.

  • A numbers in the body of the table is called joint frequency.

129 / 377

Example: Body Image and Gender

The following table summarize responses of a random sample of 1,200 U.S. college students as part of a larger survey.

About Right Overweight Underweight Row Totals
Female 560 163 37 760
Male 295 72 73 440
Column Totals 855 235 110 1,200
130 / 377

Two-Way Relative Frequency Tables and Probability

  • A two-way relative frequency table is obtained from a two-way frequency table by converting frequencies in a two-way table to relative frequencies.

  • Marginal probability $$P(X)=\frac{\text{Marginal frequency in}~ X}{\text{Total}}$$

  • Conditional probability $$P(X|Y)=\frac{\text{Joint frequency}}{\text{Marginal Frequency in}~Y} \quad \text{or}\quad P(Y|X)=\frac{\text{Joint frequency}}{\text{Marginal Frequency in}~X}$$

  • Joint probability $$P(X ~\text{and}~ Y)=\frac{\text{Joint frequency}}{\text{Total}}$$

  • Note that \(P(X~\text{and}~Y)=P(X)\cdot P(Y|X)=P(Y)\cdot P(X|Y).\)

131 / 377

Example: Joint and marginal probabilities of body image and gender

The following table shows joint and marginal probabilities of body image and gender.

About Right Overweight Underweight Row Totals
Female \(\frac{560}{1200}=46.67\%\) \(\frac{163}{1200}=13.58\%\) \(\frac{37}{1200}=3.08\%\) \(\frac{760}{1200}=63.33\%\)
Male \(\frac{295}{1200}=24.58\%\) \(\frac{72}{1200}=6.00\%\) \(\frac{73}{1200}=6.08\%\) \(\frac{440}{1200}=36.67\%\)
Column Totals \(\frac{855}{1200}=71.25\%\) \(\frac{235}{1200}=19.58\%\) \(\frac{110}{1200}=9.17\%\) \(\frac{1200}{1200}=100.00\%\)
132 / 377

Example: Conditional probabilities of body image by gender

The following table shows probabilities of randomly select male or female who has a certain body image.

About Right Overweight Underweight Row Totals
Female \(\frac{560}{760}=73.68\%\) \(\frac{163}{760}=21.45\%\) \(\frac{37}{760}=4.87\%\) \(\frac{760}{7600}=100.00\%\)
Male \(\frac{295}{440}=67.05\%\) \(\frac{72}{440}=16.36\%\) \(\frac{73}{440}=16.59\%\) \(\frac{440}{440}=100.00\%\)
133 / 377

Example: Community College Enrollment (1/2)

The following table summarizes the full-time enrollment at a community college.

Arts-Sci Bus-Econ Info Tech Health Science Graphics Design Culinary Arts Row Totals
Female 4,660 435 494 421 105 83 6,198
Male 4,334 490 564 223 97 94 5,802
Column Totals 8,994 925 1,058 644 202 177 12,000

What proportion of the total number of students are male students?

134 / 377

Example: Community College Enrollment (1/2)

The following table summarizes the full-time enrollment at a community college.

Arts-Sci Bus-Econ Info Tech Health Science Graphics Design Culinary Arts Row Totals
Female 4,660 435 494 421 105 83 6,198
Male 4,334 490 564 223 97 94 5,802
Column Totals 8,994 925 1,058 644 202 177 12,000

What proportion of the total number of students are male students?
Solution: $$P(\text{Male})=\dfrac{5802}{12000}\approx 0.4835=48.35\%.$$

134 / 377

Example: Community College Enrollment (2/2)

If we select a male student at random, what is the probability that he is in the Info Tech program?

135 / 377

Example: Community College Enrollment (2/2)

If we select a male student at random, what is the probability that he is in the Info Tech program?
Solution: $$P(\text{Info Tech}|\text{Male})=\dfrac{564}{5802}\approx 0.097=9.7\%.$$

135 / 377

Example: Community College Enrollment (2/2)

If we select a male student at random, what is the probability that he is in the Info Tech program?
Solution: $$P(\text{Info Tech}|\text{Male})=\dfrac{564}{5802}\approx 0.097=9.7\%.$$ If we select a student at random, what is the probability that the student is both a male and in the Info Tech program?

135 / 377

Example: Community College Enrollment (2/2)

If we select a male student at random, what is the probability that he is in the Info Tech program?
Solution: $$P(\text{Info Tech}|\text{Male})=\dfrac{564}{5802}\approx 0.097=9.7\%.$$ If we select a student at random, what is the probability that the student is both a male and in the Info Tech program?
Solution: $$P(\text{Male and Info Tech})=\dfrac{564}{12000}= 0.047=4.7\%.$$

135 / 377

Example: Community College Enrollment (2/2)

If we select a male student at random, what is the probability that he is in the Info Tech program?
Solution: $$P(\text{Info Tech}|\text{Male})=\dfrac{564}{5802}\approx 0.097=9.7\%.$$ If we select a student at random, what is the probability that the student is both a male and in the Info Tech program?
Solution: $$P(\text{Male and Info Tech})=\dfrac{564}{12000}= 0.047=4.7\%.$$ The probabilities are related:

135 / 377

Example: Community College Enrollment (2/2)

If we select a male student at random, what is the probability that he is in the Info Tech program?
Solution: $$P(\text{Info Tech}|\text{Male})=\dfrac{564}{5802}\approx 0.097=9.7\%.$$ If we select a student at random, what is the probability that the student is both a male and in the Info Tech program?
Solution: $$P(\text{Male and Info Tech})=\dfrac{564}{12000}= 0.047=4.7\%.$$ The probabilities are related:
Solution: $$P(\text{Male and Info Tech})=\dfrac{564}{12000}=\dfrac{5802}{12000}\cdot \dfrac{564}{5802}=P(\text{Male})\cdot P(\text{Info Tech}|\text{Male}).$$

135 / 377

Practice: A table relates weights and heights

This table relates the weights and heights of a group of individuals participating in an observational study.

Weight/Height Tall Medium Short
Obese 18 28 14
Normal 20 51 28
Underweight 12 25 9
  1. Find the total for each row and column
  2. Find the probability that a randomly chosen individual from this group is Short.
  3. Find the probability that a randomly chosen individual from this group is Obese and Short.
  4. Find the probability that a randomly chosen individual from this group is Underweight given that the individual is Tale.

Source: https://courses.lumenlearning.com/introstats1/chapter/contingency-tables/

136 / 377

Test of (No) Association

  • To understand association between categorical variables, we may think conversely. How do we test no association?

  • If the conditional probabilities are nearly equal for all categories, there may be no association between the variables. Conversely, if the conditional probabilities are different enough, we are confidence to say there is an association.

  • In general, the bigger the differences in the conditional probabilities, the stronger the association between the variables.

  • Two variables \(X\) and \(Y\) are independent if \(P(X~\text{and}~Y)=P(X)\cdot P(Y)\).

137 / 377

Example: Association between body image and gender (1/2)

Is body image related to gender?

About Right Overweight Underweight Row Totals
Female 560 163 37 760
Male 295 72 73 440
Column Totals 855 235 110 1,200
138 / 377

Example: Association between body image and gender (2/2)

Solution: Using Excel (stacked bar chart), we may compare side-by-side the conditional body image distributions for females and males

resize: Stacked Bar Chart for Gender and Body Images, 80%

As a result of our analysis, we know that the conditional distributions for males and females for body image are not the same. There is enough of a difference to believe that those two categorical variables are in fact related.

139 / 377

Percentage Reduction of Risk

  • When calculating the probability of a negative outcome, we often refer to the probability as a risk.

  • In general, we are interested in determining how much a new treatment reduces the risk compared to a reference risk

  • The percentage reduction of risk is

    $$\text{percentage reduction of risk}=\frac{\text{new treatment risk}-\text{reference risk}}{\text{reference risk}}.$$

140 / 377

Example: Risk and the Physicians’ Health Study (1/2)

Researchers in the Physicians’ Health Study (1989) designed a randomized double-blind experiment to determine whether aspirin reduces the risk of heart attack. Here are the final results.

Heart Attack No Heart Attack Row Totals
Aspirin 139 10,898 11,037
Placebo 239 10,795 11,034
Column Totals 378 21,693 22,071

Does aspirin lower the risk of having a heart attack?

141 / 377

Example: Risk and the Physicians’ Health Study (2/2)

Solution: To answer this question, we compare two conditional probabilities:

  • The probability of a heart attack given that aspirin was taken every other day. $$P(\text{heart attack}|\text{aspirin}) = 139 / 11,037 = 0.013$$
  • The probability of a heart attack given that a placebo was taken every other day. $$P(\text{heart attack}|\text{placebo}) = 239 / 11,034 = 0.022$$

The result shows that taking aspirin reduced the risk from 0.022 to 0.013.

The percentage reduction of risk is 0.0130.0220.022=0.0090.0220.41.

Therefore, we conclude that taking aspirin results in a 41% reduction in risk.

142 / 377

Hypothetical Two-way Tables

A hypothetical two-way table, also known as a hypothetical 1000 two-way table, is a two-way table constructed from given probability conditions with 1000 as the total frequency. It can be used to answer complex probability questions.

143 / 377

Example: Birth gender prediction (1/2)

A pregnant woman often opts to have an ultrasound to predict the gender of her baby. Assume the following facts are known:

  • Fact 1: 48% of the babies born are female.
  • Fact 2: The proportion of girls correctly identified is 9 out of 10.
  • Fact 3: The proportion of boys correctly identified is 3 out of 4.

Use the above facts to answer the following questions.

  1. If the examination predicts a girl, how likely the baby will be a girl?

  2. If the examination predicts a boy, how likely the baby will be a boy?

144 / 377

Example: Birth gender prediction (2/2)

Solution: Assume that we have ultrasound predictions for 1,000 random babies.

  • Fact 1 means that \(P(\text{Girl})=48\%\).
  • Fact 2 means that \(P(\text{Predicted as girl}|\text{Girl})=9/10\).
  • Fact 3 means that \(P(\text{Predicted as boy}|\text{Boy})=3/4\)

Using those facts, we may create a two-way frequency table.

Girl Boy Row Totals
Predict Girl 0.90(480)= 432 0.25(520) = 130 432+130=562
Predict Boy 480-432=48 520-130=390 48+390=438
Column Totals 480 1000-480=520 1,000

If the examination predicts a girl, the probability that the born baby is a girl is $$P(\text{Girl}|\text{predict girl})=\frac{432}{562} \approx 0.769=76.9\%.$$

If the examination predicts a boy, the probability that the born baby is a boy is $$P(\text{Boy} | \text{predict boy}) = \frac{390}{438} \approx 0.890=89\%.$$

145 / 377

Practice

The table below is based on a 1988 study of accident records conducted by the Florida State Department of Highway Safety.

Nonfatal Fatal Row Totals
Seat Belt 412,368 510 412,878
No Seat Belt 162,527 1,601 164,128
Column Totals 574,895 2,111 577,006

Does wearing a seat belt lower the risk of an accident resulting in a fatality?

146 / 377

Practice: Drug screening

A large company has instituted a mandatory employee drug screening program. Assume that the drug test used is known to be 99% accurate. That is, if an employee is a drug user, the test will come back positive (“drug detected”) 99% of the time. If an employee is a non-drug user, then the test will come back negative (“no drug detected”) 99% of the time. Assume that 2% of the employees of the company are drug users.

If an employee’s drug test comes back positive, what is the probability that the test is wrong and the employee is in fact a non drug user?

147 / 377

Create Stacked Bar Chart

  • To create a a stacked bar chart of a two-way table

    • First select the data table.

    • Look for and click Insert Column or Bar Chart in the menu Insert-> Charts.

    • In the dropdown menu, choose the third option in 2-D Column (100% Stacked Column) or the third option 2-D Bar (100% Stacked Bar).

    • To switch row/column, in the output graph, right click the row axis or the column axis, and chose the option Select Data... to make a switch.

148 / 377

Practice: Is there an association between gender and program selection

The following table summarize results from a study on program selection and gender.

Arts-Sci Bus-Econ Info Tech Health Science Graphics Design Culinary Arts Row Totals
Female 4,660 435 494 421 105 83 6,198
Male 4,334 490 564 223 97 94 5,802
Column Totals 8,994 925 1,058 644 202 177 12,000

Use Excel to answer the following question about the study.

  • Is there an association between gender and program selection? Why or why not?

  • If they are associated, is the association strong or week?

149 / 377

Basic Concepts of Probability

150 / 377

Learning Goals for Basic Concepts of Probability

  • Construct sample spaces and calculate probabilities for simple and/or compound events.

  • Use appropriate rules of probability for compound events.

151 / 377

Experiments, Sample Spaces, and Events

  • An experiment is a procedure that can be infinitely repeated and has a well-defined set of outcomes.

  • An outcome is the result of a single trial (individual repetition) of an experiment.

  • A chance experiment is an experiment that has more than one possible outcome and whose outcomes cannot be predicted with certainty.

  • The sample space of a random experiment is the set of all possible outcomes.

  • An event is a subset of the sample space.

152 / 377

Complement, Intersection and Union

  • The complement \(E^c\) of event \(E\) is the set of all outcomes in a sample space that are NOT included in event \(E\).

  • The intersection \(A\cap B\) of two events \(A\) and \(B\) is the set of all outcomes in the sample space that are shared by \(A\) and \(B\).

  • The union \(A\cup B\) of two events \(A\) and \(B\) is the set of all outcomes in the sample space that are either in \(A\) or \(B\).

  • Two events \(A\) and \(B\) are mutually exclusive if there intersection \(A\cap B\) is empty.

154 / 377

Venn Diagrams for Complement, Intersection, and Union

155 / 377

Classical Definition of Probability

  • A probability \(P(E)\) is the measures of how likely an outcome in the event \(E\) will occur in a probability experiment.

  • Equally likely means that each outcome of an experiment occurs with equal chance.

  • When the outcomes in the sample space of an chance experiment are equally likely, the probability of an event \(E\) is $$P(E)=\dfrac{\text{number of outcomes in }E}{\text{number of outcomes in }S}$$

  • Chance experiment that involves tossing fair coins, rolling fair dice and drawing a card from a well-mixed deck of cards have equally likely outcomes.

  • Note that many chance experiment do not have equally likely outcomes. For example, the majors of students in a class are not equally likely outcomes.

156 / 377

Example: Flipping a coin

Imagine flipping one fair coin (which means the chance of a head and the chance of a tail are the same). What is the probability of getting the head.

Solution: There are two possible outcomes: Head or Tail.

So the sample space is the set $$S = \{\text{Head}, \text{Tail}\}.$$

The event \(E\) of getting a head is the subset $$E=\{\text{Head}\}.$$

The probability \(P(E)=\dfrac{1}{2}=0.5\).

157 / 377

Empirical Probability

  • An empirical (or a statistical) probability is the relative frequency of occurrence of outcomes from observations in repeated experiments:

$$ \begin{aligned} P(E)=&\dfrac{\text{number of occurrence of event } E}{\text{total number of observations}}\\[0.5em] =&\dfrac{\text{frequency in }E}{\text{total frequency}}. \end{aligned} $$

158 / 377

Example: Chance of selecting a math major

A statistics class has 5 math majors and 20 other majors. If a students was randomly select from the class, what's the probability that the selected students is a math major?

159 / 377

Example: Chance of selecting a math major

A statistics class has 5 math majors and 20 other majors. If a students was randomly select from the class, what's the probability that the selected students is a math major?

Solution: The sample space is the set of all students in the statistics class. The event is the set of the 5 math majors. Then the probability is P(E)=frequency of math majorstotal frequency of students=525=0.2.

159 / 377

Theoretical Probability

  • Theoretical probability is an expected value that can be calculated by mathematical theory and assumptions.

  • When all outcomes in the sample space are equally likely, the probability of a desired event \(E\), known as a theoretical probability, is calculated by

    P(E)=number of desired outcomes for event Enumber of all possible outcomes.

  • Tree diagrams are often used for counting all possible outcomes.

160 / 377

Example: Flipping a coin twice

Find the probability of getting two heads when tossing a fair coins twice.

161 / 377

Example: Flipping a coin twice

Find the probability of getting two heads when tossing a fair coins twice.

Solution: In the first time, the coin has two possible outcomes. The second time, the coin still has two possible outcomes. By the fundamental counting principle, we know that the sample space \(S\) contains \(2\cdot 2=4\) possible outcomes.

A tree demonstrating outcomes of flipping a coin twice.

The event \(E\) of getting 2 heads contains only one outcome: head and head. So the probability of getting two head when flipping a fair coin twice is $$P(E)=\frac{1}{4}.$$
161 / 377

Empirical vs Theoretical: Coin Flip Simulation

The purpose of this activity is to experiment with a simulation of flipping a fair coin, and to see if the P(H) = 0.5.

Source: GeoGebra License: CC BY SA

162 / 377

Law of Large Numbers

  • The empirical probability of an event is an "estimate" that based upon observed data from an experiment.

  • The theoretical probability of an event is an "expected" probability based upon counting rules.

  • Law of Large Numbers: As an experiment is repeated over and over, that is the number of trials getting larger and larger, the empirical probability of an event approaches the theoretical probability of the event. (Wiki: Law of large numbers.)

  • By the law of large number, we can say that the probability of any event is the long-term relative frequency of that event.

163 / 377

Example: Law of large numbers by a coin flipping simulation

A demonstration for the law of large number by simulating coin flipping.

164 / 377

Practice: Red light runner

165 / 377

Practice: Rolling two dice

Two fair dice are thrown. Find the probabilities of the following events:

  • the sum of the two numbers is 3.
  • the sum of the two numbers is at most 3.
166 / 377

Fundamental Properties (that define probability)

  • Property 1: For an event \(E\), the probability \(P(E)\) is ranged from 0 to 1: $$0\leq P(E)\leq 1.$$

  • Property 2: If \(S\) is the sample space, then \(P(S)=1\).

  • Property 3: The probability of an event \(E=\{e_1,e_2, \cdots e_k\}\) of distinct outcome is equal to the sum of probabilities of individual outcomes: $$P(E)=P(e_1)+P(e_2)+\cdots+P(e_k)$$ where \(P(e_i)\) is the probability of getting the outcome \(e_i\).

Remark: When an event \(E\) consists of infinitely many outcomes, the right hand side of the equality in Property 3 will be an infinite sum.

167 / 377

Two Easy Consequences

  • Easy consequence 1: If events \(A\) and \(B\) are mutually exclusive, then $$P(A\cup B)=P(A)+P(B).$$

  • Easy consequence 2: The probability \(P(E)\) of an event \(E\) and the probability \(P(E^c)\) of the complement event \(E^c\) satisfies the identity: $$P(E)+P(E^c)=1.$$

    Equivalently, P(Ec)=1P(E)orP(E)=1P(Ec).

168 / 377

Example: Six-sided die (1/2)

A six-sided fair die is rolled.

  1. Find the probability of the event \(E\) getting a number less than 3, that is, \(E=\{x \mid x <3\}\).
  2. Find the probability of the complement \(E^c\) of the event \(E\).
  3. Verify that \(P(E)+P(E^c)=1\).
169 / 377

Example: Six-sided die (2/2)

Solution: The sample space of the six-sided die is \(\{1,2,3,4,5,6\}\).

The event \(E\) consists of 2 numbers: \(E=\{1, 2\}\).

The probability is P(E)=P(x=1)+P(x=2)=16+16=26=13.

The complement \(E^c=\{3, 4, 5, 6\}\) and the probability is $$ \begin{aligned} P(E^c)=&P(x=3)+P(x=4)+P(x=5)+P(x=6)\\ =&\frac16+\frac16+\frac16+\frac16 =\frac46=\frac23. \end{aligned} $$

It is clear that \(P(E)+P(E^c)=\frac13+\frac23=1.\)

170 / 377

Example: Sum of two numbers from rolling two dice (1/2)

Two six-sided fair dice were rolled. Find the probability of getting two numbers whose sum is at least 4.

171 / 377

Example: Sum of two numbers from rolling two dice (1/2)

Two six-sided fair dice were rolled. Find the probability of getting two numbers whose sum is at least 4.

Solution: Let \(E\) be the event of the sum is at least 4. Then the complement \(E^c\) consists pairs of numbers whose sum are at most 3. There are 3 such pairs: $$E^c=\{(1, 1), (1, 2), (2, 1)\}.$$

The sample space contains \(6\cdot 6=36\) possible out comes.

171 / 377

Example: Sum of two numbers from rolling two dice (2/2)

Solution: (continued)

Therefore, $$ \begin{aligned} P(E^c)=&P((1,1))+P((1,2))+P((2,1))\\ =&\frac16\cdot\frac16+\frac16\cdot\frac16+\frac16\cdot\frac16=\frac{1}{12}. \end{aligned} $$

Apply the complement rule, we find $$P(E)=1-P(E^c)=\frac{11}{12}.$$

172 / 377

Practice: Spinner with 12 numbers

173 / 377

Practice: M&M with a specific color

174 / 377

Probability for Chance Experiment with Equally likely Outcomes

When outcomes in the sample spaces are equally likely,

  • the probability of the intersection of two events is P(AB)=numbers of elements in ABnumber of elements in the sample space S.

  • the probability of the union of two event is P(AB)=numbers of elements in ABnumber of elements in the sample space S.

175 / 377

The Addition Rule

In general, the probability of the union of two events from a chance experiment is defined by the basic rules and the addition rule.

  • Addition Rule: the probability of the union of two events \(A\) and \(B\) is P(AB)=P(A)+P(B)P(AB).
176 / 377

Example: Intersection, Union and Mutually Exclusive (1/2)

A card was randomly drew from a deck of 52 cards.

  1. What's the probability of getting a heart?
  2. What's the probability of getting a face (king, queen, or jack)?
  3. What's the probability of getting a heart face?
  4. What's the probability of getting a heart or a face?
  5. What's the probability of getting a club and spade?

Standard 52 card deck

177 / 377

Example: Intersection, Union and Mutually Exclusive (2/2)

Solution: The sample space \(S\) consists of 52 cards as shown in the above picture. Among hearts, there are 3 face card. Then P(heart and face)=352=114.

There are \(22\) cards that are hearts or faces. Then P(heart or face)=2252=1126.

Since there is no card which is club and spade, we have P(club and spade)=0.

Note: \(P(\text{heart})=\frac{13}{52}\) and \(P(\text{face})=\frac{12}{52}\), the addition formula also gives P(heart or face)=1352+1252352=2252=1126.

178 / 377

Practice: Addition rule

179 / 377

The Conditional Probability

  • The conditional probability of \(A\) given \(B\), written as \(P(A\mid B)\), is the probability that event \(A\) will occur given that the event \(B\) has already occurred.

  • In the case that the chance experiment has equally likely outcomes, the conditional probability is, P(AB)=numbers of elements in ABnumber of elements in B.

  • In general, we may use fundamental rules of probability and the multiplication rule to calculate the conditional probability.

180 / 377

The Multiplication Rule

  • Multiplication Rule: the probability of the intersection of two events \(A\) and \(B\) satisfies the following equality P(AB)=P(B)P(AB)=P(A)P(BA).

  • The multiplication rule gives a formula for conditional probability: P(BA)=P(AB)P(A)P(AB)=P(AB)P(B).

181 / 377

Independent Events

  • Two events \(A\) and \(B\) are independent if P(AB)=P(A) or P(B)=P(BA). Equivalently, P(AB)=P(A)P(B).

  • Fundamental Counting Principle: if there are \(m\) ways of doing something and \(n\) ways of doing another thing, then there are \(m\cdot n\) ways of performing both actions in order.

182 / 377

Example: Conditional probability

A fair six-sided die is rolled.

Find the probability that the number rolled is a two, given that it is even.

183 / 377

Example: Conditional probability

A fair six-sided die is rolled.

Find the probability that the number rolled is a two, given that it is even.

Solution: Let \(A\) be the event of all possible even outcomes. Then $$A=\{2, 4, 6\}.$$

Let \(B\) be the event consisting of the outcome 2. Then $$B=\{2\}.$$

The intersection event \(A\cap B\) consists of the number \(2\).

By the definition of conditional probability, $$P(B|A)=\dfrac{P(A\cap B)}{P(A)}=\dfrac{1}{3}.$$

183 / 377

Example: Fundamental counting principle and multiplication rule (1/2)

Consider flipping a fair coin and rolling a fair six-sided die together.

  1. What's the probability that the coin shows a head?
  2. Given that a head occurs, what's the probability that the die shows a number bigger than 4?
  3. What's the probability of getting a head and a number bigger than 4?
  4. Verify that flipping a head and rolling a number bigger than 4 are independent events.
184 / 377

Example: Fundamental counting principle and multiplication rule (1/2)

Consider flipping a fair coin and rolling a fair six-sided die together.

  1. What's the probability that the coin shows a head?
  2. Given that a head occurs, what's the probability that the die shows a number bigger than 4?
  3. What's the probability of getting a head and a number bigger than 4?
  4. Verify that flipping a head and rolling a number bigger than 4 are independent events.

Solution: By the fundamental counting principle, the sample space consists of \(2\times 6=12\) elements. Let \(H\) be the event of getting a head and \(D\) be the event getting a number bigger than 4.

Then H=H1,H2,H3,H4,H5,H6 DH=H5,H6.

184 / 377

Example: Fundamental counting principle and multiplication rule (2/2)

Solution: (continued)
The probability of getting a head is \(P(H)=\frac12\).

Given that a head shows, the change of getting a number bigger than 4 is $$P(D\mid H)=\frac{2}{6}=\frac13.$$

By the multiplication rule, $$P(H\cap D)=P(H)P(D\mid H)=\frac12\cdot\frac13=\frac16.$$

Note that \(D=\{H5, T5, H6, T6\}\). Then $$P(D)=\frac{4}{12}=\frac{1}{3}=P(D\mid H).$$ Therefore, \(H\) and \(D\) are independent.

185 / 377

Example: From multiplication rule to addition rule (1/2)

The probability that a student borrows a statistics book from the library is 0.3. The probability that a student borrows a biology book is 0.4. Given that a student borrowed a biology book, the probability that he/she borrows a statistics book is 0.6.

  1. Find the probability that a student borrows a statistics book and a biology book.
  2. Find the probability that a student barrows a statistics boor or a biology book.
186 / 377

Example: From multiplication rule to addition rule (2/2)

Solution: Denote by \(S\) the event that a student borrows a statistics book, and \(B\) the event that the student borrows a biology book.

From the given conditions, we know that \(P(S)=0.3\), \(P(B)=0.4\) and \(P(S\mid B)0.6\).

By the multiplication rule, we know P(SB)=P(SB)P(B)=0.40.6=0.24.

By the addition rule, we get $$ \begin{aligned} P(S\cup B)=&P(S)+P(B)-P(S\cap B)\\ =&0.3+0.4-0.24=0.46. \end{aligned} $$

187 / 377

Sampling with Replacement or without Replacement

  • With replacement: If each member of a population is replaced after it is picked, then that member has the possibility of being chosen more than once. When sampling is done with replacement, then events are considered to be independent, meaning the result of the first pick will not change the probabilities for the second pick.

  • Without replacement: When sampling is done without replacement, each member of a population may be chosen only once. In this case, the probabilities for the second pick are affected by the result of the first pick. The events are considered to be dependent or not independent.

188 / 377

Example: Drawing cards with replacement

Two cards were randomly drawn from a standard deck of 52 cards with replacement. Find the probability of getting exactly one club card.

189 / 377

Example: Drawing cards with replacement

Two cards were randomly drawn from a standard deck of 52 cards with replacement. Find the probability of getting exactly one club card.

Solution: There are two different pairs with exactly one club card: (club, not club), (not club, club).

When drawing with replacement, the events are considered to be independent. Therefore, the probability in those two situations are $$P(\text{(club, not club)})=P(\text{club})\cdot P(\text{not club})=\frac{13}{52}\cdot\frac{39}{52},$$ $$P(\text{(not club, club)})=P(\text{not club})\cdot P(\text{club})=\frac{39}{52}\cdot\frac{13}{52}.$$ Then the probability of getting exactly one club is $$P(\text{exactly one club})=\frac{13}{52}\cdot\frac{39}{52}+\frac{39}{52}\cdot\frac{13}{52}=\frac{3}{8}.$$

189 / 377

Example: Drawing cards without replacement

Two cards were randomly drawn from a standard deck of 52 cards without replacement, which means the first card will not be put back.

  • Find the probability that getting two spades.
  • Find the probability that getting exactly one spade face card.
190 / 377

Example: Drawing cards without replacement

Two cards were randomly drawn from a standard deck of 52 cards without replacement, which means the first card will not be put back.

  • Find the probability that getting two spades.
  • Find the probability that getting exactly one spade face card.

Solution: Let \(S1\) be the event of getting a spade in the first drawing and \(S2\mid S1\) be the event of getting the second spade given the first card is a spade. The probability \(P(S1)=\frac{13}{52}=\frac14\). The probability of \(P(S2\mid S1)=\frac{12}{51}\). Then the probability of getting two spades is P(S1 and S2)=P(S1)P(S2S1)=141251=451.

Let \(NS1\) and \(NS2\) be events of not getting a spade card in first and second drawing respectively. The probability of getting exactly one spade card is P(S1 and NS2)+P(NS1 and S2)=13523951+39521351=39102.

190 / 377

Practice: Guess a password

191 / 377

Practice: Conditional probability

A special deck of 16 cards has 4 that are blue, 4 yellow, 4 green, and 4 red. The four cards of each color are numbered from one to four. A single card is drawn at random. Find the following probabilities.

  1. The probability that the card drawn is red.
  2. The probability that the card is red, given that it is not green.
  3. The probability that the card is red, given that it is neither red nor yellow.
  4. The probability that the card is red, given that it is not a four.
192 / 377

Practice: Conditional probability subject to complement

193 / 377

Practice: Pens drawn from a box without replacement

A box contains 10 pens, 6 black and 4 red. Two pens are drawn without replacement, which means that the first one is not put back.

  • What is the probability that both pens are red?
  • What is the probability that at most one pen is red?
  • What is the probability that at least one pen is red?
194 / 377

Practice: Classical question on basic rules of probability

195 / 377

Discrete Random Variables

196 / 377

Learning Goals for Discrete Random Variables

  • Demonstrate understanding of random variables

  • Demonstrate understanding of characteristics of binomial distributions.

  • Calculate accurate probabilities of discrete random variables and interpret them in a variety of settings.

197 / 377

Random Variables

  • A random variable, usually written \(X\), is a variable whose values are numerical quantities of possible outcomes a random experiment.

  • A discrete random variable takes on only a finite or countable number of distinct values. For example,

    • Rolling a fair dice, the number of dots on the top faces is a discrete random variables takes on the possible values: 1, 2, 3, ,4, 5, 6.
    • Flipping a fair coin 10 times, the number of heads is a discrete random variable takes on the possible values: 1, 2, 3, ..., 10.
  • A continuous random variable takes on values which form an interval of numbers. For example,

    • The height of an randomly select 10 year-old boy in US is normally between 129 cm and 157 cm. So the height is a continuous random variable.
    • The measure the voltage at an randomly electrical outlet normally is between 118 and 122. SO the measure of voltage is a continuous random variable.
198 / 377

Practice: Discrete or continuous

Classify each random variable as either discrete or continuous.

  1. The number of boys in a randomly selected three-child family.

  2. The temperature of a cup of coffee served at a restaurant.

  3. The number of math majors in randomly selected group of 10 students.

  4. The amount of rain recorded in a small town one day.

199 / 377

Practice: Possible values of the variable

Identify the set of possible values for each random variable. (Make a reasonable estimate based on experience, where necessary.)

  1. The sum of numbers on the top of two fair dice.

  2. The waiting time of a randomly selected customer at a restaurant.

200 / 377

Probability Distributions

  • The probability distribution of a discrete random variable \(X\) is defined by the probability associated with each possible value of \(X\).

  • A probability distribution of a discrete random variable is usually characterized by a table of all possible values \(x\) together with probabilities \(P(x)\), or a probability histogram, or a formula.

  • A random variable \(X\) (discrete and continuous) always has a cumulative distribution function: \(F_X(x)=P(X\leq x)\) (= \(\sum_{x_i\leq x} P(x_i)\) if \(X\) is discrete).

201 / 377

Basic Properties of Probability Distributions

  • Recall basic rules of probability:

    • \(0\leq P(X=x)\leq 1\).

    • the sum of all the probabilities is 1, that is \(P(X\leq x_{max})=1\).

    • In particular, \(0\leq F_X(x)\leq 1\).

  • The probability distribution can be recovered from its cumulative distribution function. Indeed, for a discrete random variable \(X\), we have $$P(X=x_i)=P(X\le x_i)-P(X\le x_{i-1}),$$ where \(P(X\le x_i)=\sum_{k=1}^iP(X=x_k)\).

202 / 377

Example: Probability distribution of flipping two fair coins

Let \(X\) be the number of heads that are observed when tossing two fair coins.

  1. Construct the probability distribution for \(X\).
  2. Find \(P(X\le 1)\) and \(P(X\le 2)\).
203 / 377

Example: Probability distribution of flipping two fair coins

Let \(X\) be the number of heads that are observed when tossing two fair coins.

  1. Construct the probability distribution for \(X\).
  2. Find \(P(X\le 1)\) and \(P(X\le 2)\).

Solution: The possible values of numbers of head are \(0\), \(1\) and \(2\). The probability distribution can be characterized by the following table:

\(X\) 0 1 2
\(P(X)\) 0.25 0.5 0.25

From the table, we may find the following cumulative distributions: $$P(X\leq 1)=P(X=0)+P(X=1)=0.25+0.5=0.75.$$ $$P(X\leq 2)= P(X=0)+P(X=1)+P(X=2)=0.25+0.5+0.25=1.$$

203 / 377

Example: Probability Histogram of an Unfair Coin

The probability distribution of an unfair coin is characterized by the following histogram.

204 / 377

Example: Probability Histogram of an Unfair Coin

The probability distribution of an unfair coin is characterized by the following histogram.

Solution: Let \(X\) be the number of heads. From the probability histogram, we know that \(P(X=0)=0.36\), and \(P(X=1)=0.47\).

Then the probability of getting at most 1 head is P(X1)=P(X=0)+P(X=1)=0.36+0.47=0.83.

204 / 377

Practice: Probability distribution of rolling a pair of fair dice

A pair of fair dice is rolled. Let \(X\) denote the sum of the number of dots on the top faces.

  1. Construct the probability distribution of \(X\).
  2. Find the probability that \(X\) takes an odd value.
205 / 377

Practice: Probability distribution from a histogram

206 / 377

Mean and Standard Deviation of a Discrete Random Variable

  • The the expected value \(E(X)\) (also called mean and denoted by \(\mu\)) of a discrete random variable \(X\) is the number $$\mu=E(X)=\sum xP(x).$$

  • The variance \(Var(X)\) (also denoted by \(\sigma^2\)) of a discrete random variable \(X\) is the number $$\sigma^2=Var(E)=\sum (x-\mu)^2P(X).$$

  • The standard deviation \(\sigma\) of a discrete random variable \(X\) is the square root of its variance: $$\sigma=\sqrt{\sum (x-\mu)^2P(X)}.$$

207 / 377

Some Properties of Expected Value (Optional)

  • The expected value of a linear combination two random variables \(X\) and \(Y\) is the linear combination of their expected values, that is $$E(aX+bY)=aE(X)+bE(Y).$$

  • The expected value in general is not multiplicative, that is \(E(XY)\ne E(X)E(Y)\).

  • If the two random variable \(X\) and \(Y\) are independent, then $$E(XY)=E(X)E(Y).$$

  • The variance can be computed using expected values: $$\mathrm{Var}(X)=E(X^{2})-(E[X])^{2}.$$

  • Let \(X\) and \(Y\) be two independent variables. Then the variance of a linear combination \(aX+bY\) equals $$Var(aX+bY)=a^2Var(X)+b^2Var(Y)$$

208 / 377

Example: Expected Gain

One thousand raffle tickets are sold for 2each.Eachhasanequalchanceofwinning.Firstprizeis500, second prize is 300,andthirdprizeis100. Find the expected value of gain, and interpret its meaning.

209 / 377

Example: Expected Gain

One thousand raffle tickets are sold for 2each.Eachhasanequalchanceofwinning.Firstprizeis500, second prize is 300,andthirdprizeis100. Find the expected value of gain, and interpret its meaning.

Solution: Let \(X\) denote the net gain from purchasing one ticket. The probability distribution for \(X\) is

\(X\) 498 298 98 -2
\(P(X)\) \(\frac{1}{1000}\) \(\frac{1}{1000}\) \(\frac{1}{1000}\) \(\frac{997}{1000}\)

The expected gain is $$E(X)= 498\cdot \frac{1}{1000}+ 298\cdot\frac{1}{1000}+98\cdot \frac{1}{1000}+(-2)\cdot \frac{997}{1000}=-1.1,$$ which means that when buying one ticket, the buyer may expect a loss of $1.1.

209 / 377

Example: Waiting Time

The wait times (rounded to multiples of 5) in the cafeteria at a Community College has the following probability distribution. Find the expected waiting time and the standard deviation.

\(X\) (minutes) 5 10 15 20 25
\(P(X)\) 0.13 0.25 0.31 0.21 0.1
210 / 377

Example: Waiting Time

The wait times (rounded to multiples of 5) in the cafeteria at a Community College has the following probability distribution. Find the expected waiting time and the standard deviation.

\(X\) (minutes) 5 10 15 20 25
\(P(X)\) 0.13 0.25 0.31 0.21 0.1

Solution: The expected waiting time is $$\mu= 5\cdot 0.13 +10\cdot 0.25+15\cdot 0.31+20\cdot 0.21 + 25\cdot 0.1 = 14.5.$$ The standard deviation is then $$\scriptstyle \sigma=\sqrt{(5-14.5)^2\cdot 0.13 +(10-14.5)^2\cdot 0.25+(15-14.5)^2\cdot 0.31+(20-14.5)^2\cdot 0.21 + (25-14.5)^2\cdot 0.1}\approx 5.9.$$

210 / 377

Example: Unfair Die (1/2)

The probability distribution of an unfair die is given in the following table.

\(X\) 1 2 3 4 5 6
\(P(X)\) 0.18 0.12 \(\,?\,\) 0.14 0.23 0.17
  1. Find \(P(X=3)\).
  2. Find the mean, variance and standard deviation of this probability distribution.
211 / 377

Example: Unfair Die (1/2)

The probability distribution of an unfair die is given in the following table.

\(X\) 1 2 3 4 5 6
\(P(X)\) 0.18 0.12 \(\,?\,\) 0.14 0.23 0.17
  1. Find \(P(X=3)\).
  2. Find the mean, variance and standard deviation of this probability distribution.

Solution: Since the sum of probabilities must be 1, we know that P(X=3)=1(0.18+0.12+0.14+0.23+0.17)=0.16

211 / 377

Example: Unfair Die (2/2)

Solution: (Continued) The mean is the weighted sum μ=0.181+0.122+0.163+0.144+0.235+0.176=3.63.

The variance is σ2=0.18(13.63)2+0.12(23.63)2+0.16(33.63)2+0.14(43.63)2+0.23(53.63)2+0.17(63.63)2=3.0331.

Therefore, the standard deviation is σ=σ2=3.03311.7416.

212 / 377

Practice: Lottery tickets

Seven thousand lottery tickets are sold for 5each.Oneticketwillwin2,000, two tickets will win 750each,andfiveticketswillwin100 each. Let \(X\) denote the net gain from the purchase of a randomly selected ticket.

  1. Construct the probability distribution of \(X\).
  2. Compute the expected value \(E(X)\) of \(X\). Interpret its meaning.
  3. Compute the standard deviation \(\sigma\) of \(X\).
213 / 377

Binomial Distribution

  • A binomial experiment is a probability experiment satisfying:

    1. The experiment has a fixed number \(n\) of independent trials.
    2. Each trial has only two possible outcomes: a success (S) or a failure (F).
    3. The probability \(p\) of a success is the same for each trial.
  • The discrete random variable \(X\) counting the number of successes in the \(n\) trials is the binomial random variable. We say \(X\) has a binomial distribution with parameters \(n\) and \(p\) and write it as \(X\sim B(n, p)\).

  • For \(X\sim B(n, p)\), the probability of getting exactly \(x\) successes in \(n\) trials is $$P(X=x)=B(x,n,p)={_n C_x} p^x(1-p)^{n-x}=\frac{n!}{(n-x)!x!}p^x(1-p)^{n-x}.$$

  • The notation \(n!=n(n-1)\cdots 1\) is read as \(n\) factorial. We set \(0!=1.\)

  • The notation \({_n C_x}=\frac{n!}{(n-x)!x!}\) is read as \(n\) choose \(x\), which is the number of ways to choose \(x\) objects from a set of \(n\) objects.

214 / 377

Example: Probability from drawing card multiple times (1/2)

A card is selected from a standard deck and replaced. This experiment is repeated a total of \(5\) times.

  • Find the probability of selecting exactly \(3\) clubs.
  • Find the probability of getting at least \(3\) clubs.
215 / 377

Example: Probability from drawing card multiple times (1/2)

A card is selected from a standard deck and replaced. This experiment is repeated a total of \(5\) times.

  • Find the probability of selecting exactly \(3\) clubs.
  • Find the probability of getting at least \(3\) clubs.

Solution: This is a binomial experiment. The number to total trial is \(n=5\). The number of success is \(3\). The chance of a success is \(p=\frac{13}{52}=\frac14\). Apply the binomial probability formula, we have $$P(X=3)=\frac{5!}{3!2!} \left(\frac{1}{4}\right)^3\left(\frac34\right)^2=10\cdot\frac{9}{4^5}\approx 0.088.$$ The probability \(P(X=3)\) can also be found from the binomial distribution table or by using the Excel function BINOM.DIST(3,5,1/4,FALSE).

To probability of getting at least \(3\) club is $$P(X\geq 3) =1-P(X\leq 2)=1-(P(0)+P(1)+P(2))\approx 1-0.8965=0.1035.$$

215 / 377

Example: Probability from drawing card multiple times (2/2)

Solution:(continued) To calculate \(P(X\leq 2)\), we may also use the binomial distribution table or the Excel function BINOM.DIST().

Method 1: As \(n=5\) and \(p=0.25\), we use the following portion of the cumulative binomial distribution table.

Binomial Probability Table n=5

n x 0.1 0.15 0.2 0.25 0.3 0.35 0.4
5 0 0.5905 0.4437 0.3277 0.2373 0.1681 0.116 0.0778
5 1 0.3281 0.3915 0.4096 0.3955 0.3602 0.3124 0.2592
5 2 0.0729 0.1382 0.2048 0.2637 0.3087 0.3364 0.3456
5 3 0.0081 0.0244 0.0512 0.0879 0.1323 0.1811 0.2304

\(P(X\le 2) \approx 0.2373+0.3955+0.2637= 0.8965.\)

Method 2: In Excel, \(P(X\le 2)\) =BINOM.DIST(2,5,0.25,TRUE) \(\approx 0.8965\).

216 / 377

In a calculator, use binompdf(n, p, x) for \(P(X=x)\) and binomcdf(n, p, x) for \(P(X\leq x)\).

Practice: Find probability of from a binomial distribution

Let \(X\) be a binomial random variable with parameters \(n = 5\), \(p=0.2\). Find the probabilities \(\quad \text{1.}\,\, P(X=3) \qquad \text{2.}\,\, P(X<3)\qquad \text{3.}\,\, P(X>3).\)

217 / 377

Practice: Machine defect rate

218 / 377

Mean and Standard Deviation of Binomial Distribution (4/5)

  • The mean of a binomial distribution of \(n\) trials is $$\mu =\sum xP(x)=\sum x\cdot \dfrac{n!}{(n-x)!x!}p^x(1-p)^{n-x} = np.$$

  • The variance of a binomial distribution of \(n\) trials is $$\sigma^2 =\sum (x-np)^2P(x)=\sum x^2P(x)-(np)^2=np(1-p).$$

  • The variance of a binomial distribution of \(n\) trials is $$\sigma=\sqrt{np(1-p)}.$$

  • We consider an event \(E\) unusual if the probability \(P(E)\leq 5\%\).

219 / 377

Example: Expected value and SD of cracked eggs

The probability that an egg in a retail package is cracked or broken is 0.02.

  1. Find the average number of cracked or broken eggs in a one dozen carton.
  2. Find the standard deviation.
  3. Is getting at least two broken eggs unusual?
220 / 377

Example: Expected value and SD of cracked eggs

The probability that an egg in a retail package is cracked or broken is 0.02.

  1. Find the average number of cracked or broken eggs in a one dozen carton.
  2. Find the standard deviation.
  3. Is getting at least two broken eggs unusual?

Solution: Since there are 12 eggs and the chance of getting a cracked egg is 0.02, the average number of cracked is $$\mu =np=12\cdot 0.02=0.24.$$

The standard deviation is $$\sigma=\sqrt{12\cdot 0.02\cdot(1-0.02)}\approx 0.4850.$$

Recall the Empirical rule: 95% data are within 2 standard deviation away from the mean. Since \(2>0.24+2\cdot 0.4850\), the chance of getting at least two cracked eggs is less than 5%, which is considered as unusual.

220 / 377

Practice: Quality of grapefruit

Adverse growing conditions have caused 5% of grapefruit grown in a certain region to be of inferior quality. Grapefruit are sold by the dozen.

  1. Find the average number of inferior quality grapefruit per box of a dozen.

  2. A box that contains two or more grapefruit of inferior quality will cause a strong adverse customer reaction. Find the probability that a box of one dozen grapefruit will contain two or more grapefruit of inferior quality.

221 / 377

Practice: Mean and SD of a binomial distribution

222 / 377

Excel Functions for Binomial

Let \(X\) be a binomial random variable with parameters \(n\) and \(p\), that is \(X\sim B(n, p)\). In Excel, \(P(X=x)\) is given by BINORM.DIST(x, n, p, FALSE) and \(P(X\le x)\) is given by BINORM.DIST(x, n, p, TRUE). You may click input function \(f_x\) and then search binorm to find the function.

223 / 377

Practice: Find the mean of a discrete probability distribution

224 / 377

Practice: Find the standard deviation of a discrete probability distribution

225 / 377

Practice: The number of sales of new employees

A company tracks the number of sales new employees make each day during a 100-day probationary period. The results for one new employee are shown at the right.

  1. Find the probability of each outcome.
  2. Construct a probability distribution table.
  3. Find the mean of the probability distribution.
  4. Find the variance and standard deviation.

Sales per day \(x\) Number of days \(f\)
0 16
1 19
2 15
3 21
4 9
5 10
6 8
7 2
226 / 377

Practice: Probability within one SD of a discrete probability distribution

227 / 377

Practice: Probability from a poll

228 / 377

Practice: Chance of continuous successes of a type of surgery

A type of surgery has a 90% chance of success. The surgery is performed on three patients. Find the probability of the surgery being successful on exactly two patients.

229 / 377

Continuous Random Variables

230 / 377

Learning Goals for Probability and Probability Distribution

  • Demonstrate understanding of characteristics of normal distributions.

  • Calculate accurate probabilities of continuous random variables and interpret them in a variety of settings.

  • Calculate the standardized value (or \(z\)-score).

231 / 377

Probability Distribution of a Continuous Random Variable

  • The probability distribution of a continuous random variable \(X\) is characterized by its probability density function \(f(X)\) satisfying that the probability \(P(a\leq X\leq b)\) equals the area above the interval \([a, b]\) but under the graph of the density function \(f(X)\) which is also called a density curve.

232 / 377

Properties of Probability Distribution of a Continuous Random Variable

  • The probability density function \(f\) is nonnegative, that is \(f(X)\ge 0\).

  • The total area under a density curve is 1.

  • The cumulative probability \(P(X\le b)\) of a random variable \(X\) equals the area under the density curve to the left side of \(b\).

  • By the addition rule of probability, we have

    • $$P(a\le X\le b)=P(X\le b)-P(X\le a)$$
    • $$P(X\ge b)=1-P(X\le b)$$
  • As a line segment has no area, we have \(P(X\le a)=P(X< a)\) as well as \(P(X\ge b)=P(X>b)\)

233 / 377

Example: An Uniform Distribution

Let \(X\) be the amount of time that a commuter must wait for a train. Suppose \(X\) has a probability density function $$ f(X)= \begin{cases} 0.1, & 0\leq X\leq 10\\ 0, & \text{otherwise} \end{cases} $$

What is the probability that the commuter's waiting time is less than 4 minutes?

234 / 377

Example: An Uniform Distribution

Let \(X\) be the amount of time that a commuter must wait for a train. Suppose \(X\) has a probability density function $$ f(X)= \begin{cases} 0.1, & 0\leq X\leq 10\\ 0, & \text{otherwise} \end{cases} $$

What is the probability that the commuter's waiting time is less than 4 minutes?

Solution: The probability \(P(X\leq 4)\) is the area under the horizontal line \(y=0.1\) to the left of \(X=5\). Since \(f(X)=0\) for \(X<0\), the area is the area of the rectangle with width 4 and height 0.1. So the probability is \(P(X\leq 4)=0.1\cdot 4=0.4\).

234 / 377

Normal Distribution

  • A normal distribution has a density function \(f(x)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}},\) where \(\mu\) is the mean, \(\sigma\) is the standard deviation, \(\pi\approx 3.14159\) and \(e\approx 2.71828\). The graph of \(f\) is called a normal curve.

  • We write \(X\sim \mathcal{N}(\mu, \sigma^2)\) for a normal random variable \(X\) with the mean \(\mu\) and the standard deviation \(\sigma\).

  • A normal distribution has the following properties:

    • The mean, median, and mode are equal.
    • The normal curve is bell shaped and symmetric with respect to the mean.
    • The total area under the curve and above the \(x\)-axis is \(1\).
    • The normal curve approaches, but never touches, the \(x\)-axis as \(x\) goes to \(\pm\infty\).
    • Between \(\mu-\sigma\) and \(\mu+\sigma\), the graph curves downward. On the left side of \(\mu-\sigma\) or the right side of \(\mu+\sigma\), the graph curves upward. A point at which the curve changes the direction of curving is called an inflection point.
235 / 377

Normal Curves with Different Means and Standard Deviations

236 / 377

The Empirical Rule for Normal Distributions

For any normal distribution, the proportion of data values within 1, 2, and 3 standard deviations away from the mean are approximately 68.3%, 95.4% and 99.7% respectively.

237 / 377

Example: Foot length (1/2)

Suppose that foot length of a randomly chosen adult male is a normal random variable with the mean \(\mu=11\) and the standard deviation \(\sigma=1.5\).

  • How likely is a male's foot length to be smaller than 9.5 inches
  • How likely is a male's foot length to be bigger than 8 inches
238 / 377

Example: Foot length (1/2)

Suppose that foot length of a randomly chosen adult male is a normal random variable with the mean \(\mu=11\) and the standard deviation \(\sigma=1.5\).

  • How likely is a male's foot length to be smaller than 9.5 inches
  • How likely is a male's foot length to be bigger than 8 inches

Solution: Let's first sketch the normal curve.

238 / 377

Example: Foot length (2/2)

Solution: (Continued)

Note that \(9.5=11-1.5=\mu-\sigma\). By the symmetry of normal curve, we know that the probability \(P(X<9.5)\) is the shaded area on the left. Because the probability of getting a foot length within 1 standard deviation away from the mean is 0.683. Then $$\scriptstyle P(X<9.5)=\frac12(1-P(9.5<X<12.5))\approx\frac12(1-0.683)=0.1585.$$

Note that \(8=11-2\cdot 1.5=\mu-2\sigma\). Because the probability of getting a foot length within 2 standard deviation away from the mean is 0.954. Then $$\scriptstyle P(X>8)=(1-P(X<8))=1-\frac12(1- P(8<X<14))=1-\frac12(1-0.954)=0.977.$$

239 / 377

Standard Normal Distribution

  • A normal distribution is called a standard normal distribution if the mean is \(\mu=0\) and the standard deviation is \(\sigma=1\).

  • A random normal variable can be standardized by the following formula \(z=\frac{x-\mu}{\sigma}.\) We call the value \(z\) the \(Z\)-score of \(x\). In Excel, the \(Z\)-score of \(x\) can be calculated using the function STANDARDIZE().

  • Standardization preserves probability: $$P(a<X<b)=P\left(\frac{a-\mu}{\sigma}< Z < \frac{b-\mu}{\sigma}\right).$$

  • The probability \(P(Z< z)\) of a standard normal random variable \(Z\) can be found using the Excel function NORM.S,DIST(z, TRUE) or the standard normal distribution table.

  • The probability \(P(X< x)\) of a normal random variable \(X\) can be calculated using the Excel function NORM.DIST(x, mean, sd, TRUE).

240 / 377

Example: Find the Standard Score

Let \(X\) be a norma random variable with the mean \(\mu = 8\) and the standard deviation \(\sigma=2\).

  1. Find the \(Z\)-score for the value \(X=13\).
  2. Find the \(X\)-value for the \(Z\)-score \(z=-0.6\).
241 / 377

Example: Find the Standard Score

Let \(X\) be a norma random variable with the mean \(\mu = 8\) and the standard deviation \(\sigma=2\).

  1. Find the \(Z\)-score for the value \(X=13\).
  2. Find the \(X\)-value for the \(Z\)-score \(z=-0.6\).

Solution: The \(z\)-score for the value \(X=13\) is $$z=\dfrac{x-\mu}{\sigma}=\dfrac{13-8}{2}=\dfrac{5}{2}=2.5.$$

The \(X\)-value for the the \(Z\)-score \(z=-0.6\) is $$x=z\cdot\sigma+\mu=-0.6\cdot 2+8=-1.2+8=6.7.$$

241 / 377

Example: Probability of a Standard Normal Random Variable (1/2)

Let \(Z\) be a standard normal random variable.

  1. Find \(P(Z<1.21)\).
  2. Find \(P(Z\geq 1.21)\).
  3. Find \(P(0<Z\leq 1.21)\).
Z 0 0.01 0.02
1.2 0.8849 0.8869 0.8888
1.3 0.9856 0.9856 0.9857
242 / 377

Example: Probability of a Standard Normal Random Variable (1/2)

Let \(Z\) be a standard normal random variable.

  1. Find \(P(Z<1.21)\).
  2. Find \(P(Z\geq 1.21)\).
  3. Find \(P(0<Z\leq 1.21)\).
Z 0 0.01 0.02
1.2 0.8849 0.8869 0.8888
1.3 0.9856 0.9856 0.9857

Solution: To find the probability, we may use the standard normal distribution table, or the Excel function NORM.S.DIST(z,TRUE).

  1. From the table, we see that \(P(Z<1.21)\approx 0.8869\).
  2. Since the total area under the normal curve is 1, we get $$P(Z\geq 1.21)\approx 1-0.8869=0.1131.$$
  3. By the symmetry, \(P(Z<0)=0.5\). Then the probability $$P(0<Z<1.21)\approx 0.8869-0.5=0.3869.$$
242 / 377

Example: Heights of 25-year-old women

The heights of 25-year-old women in a certain region are approximately normally distributed with mean 62 inches and standard deviation 4 inches. Find the probability that a randomly selected 25-year-old woman is more than 67 inches tall.

243 / 377

Example: Heights of 25-year-old women

The heights of 25-year-old women in a certain region are approximately normally distributed with mean 62 inches and standard deviation 4 inches. Find the probability that a randomly selected 25-year-old woman is more than 67 inches tall.

Solution: Let's first sketch the normal curve.

Z 0.04 0.05 0.06
1.2 0.8925 0.8944 0.8962

The probaiblity is \(P(X>67)=1-P(X<67)\). To calculate \(P(X<67)\), one way is to use the standard normal distribution table. First find the \(Z\)-score is \(z=\frac{67-62}{4}=1.25\). Then \(P(Z<1.25)\approx 0.8944\).

Another way is to use the Excel function NORM.DIST(67, 62, 4, TRUE).

Then \(P(X>67)\approx 1-0.8944=0.1056\).

243 / 377

Cutoff Value for a Given Tail Area

  • The \(k\)-th percentile for a random variable \(X\) is the value \(x_k\) that cuts off a left tail with the area \(k/100\), that is \(P(X<x_k)=\frac{k}{100}\), where \(0\leq k\leq 100\).

  • Let \(c\) be a nonnegative number less than or equal to 1. The \((100c)\)-th percentile for the standard normal distribution is usually denoted as \(-z_c\), that is \(P(Z<-z_c)=c\). By symmetry, \(z_c\) is the value such that \(P(Z> z_c)=c\), that is \(P(Z<z_c)=1-c\).

  • For a noraml random variable \(X\) with the mean \(\mu\) and standard deviation \(\sigma\), the cutoff value \(x^*\) with a tail area \(c\), can be calculated using the standardization formula, that is, $$x^*=z^*\cdot \sigma+\mu,$$ where \(z^*\) is the cutoff \(z\)-score with the tail area \(c\), that is \(z^*=-z_c\) given that \(c\) is the left-tail area and \(z^*=z_c\) given that \(c\) is the right tail area.

244 / 377

Example: Cutoff Value for a Normal Random Variable

Let \(X\) be the normal random variable with mean \(6\) and standard deviation \(3\). Suppose the value \(x^*\) cuts off a left-tail area \(0.05\). Find the value \(x^*\).

Solution: One way to find the value \(x^*\) is to use the Excel function NORM.INV(0.05, 6,3): $$x^* \approx 1.065.$$

Another way is to use the standardization formula. Using the standard normal distrution table or the Excel function NORM.S.INV(0.05), we find that \(-z_{0.05}=-1.645\). Then $$x^*=-z_{0.05}\cdot 3+6=1.065.$$

z − 0.04 − 0.05
-1.6 0.0505 0.04947

Note: if the value \(c\) is between two cells in the standard deviation table, we take \(z^*\) be the average of the two \(z\)-scores associated to the values in the two cells.

245 / 377

Example: Math Course Placement

Scores on a standardized college placement examination are normally distributed with mean 60 and standard deviation 13. Students whose scores are in the top 5% will be placed in a Calculs II course. Find the minimum score needed to be placed in a Calculus II course.

Solution: Let \(x^*\) be the minimum score. From the question, we know that \(P(X\geq x^*)=0.05\). Equivalently, \(P(X<x^*)=1-0.05=0.95\).

Using the function NORM.INV(0.95, 60, 13), we find the \(x^*\) score is $$x^*=81.38.$$

Another way is to find \(z_{0.05}\) first, then use the standardization formula. Use the standard normal distribution table or the Excel function, you will find that the \(z\)-score is \(z^*=z_{0.05}=1.64\). Then $$x^*=z^*\sigma+\mu=1.64\cdot 13+60=81.32$$

So the minimum score needed is \(82\).

Note: The minimum score is the same as the 95th percentile.

246 / 377

Practice: Dash washing time

247 / 377

Practice: Find probabilities of a normal random variable

  1. Let \(Z\) be a standard normal random variable. Find the probabilities: $$\text{1.}\,\, P(Z<1.58)\quad \text{2.}\,\, P(-0.6<Z<1.67)\quad \text{3.}\,\, P(Z>0.19).$$

  2. Let \(X\) be a normal random variable with \(\mu=5\) and \(\sigma=2\). Find the probabilities: $$\text{1.}\,\, P(-2<X<8)\quad \text{2.}\,\, P(X>-1) \quad \text{3.}\,\, P(X<4).$$

248 / 377

Practice: Fruit weight

249 / 377

Practice: Shortest lifespan

250 / 377

Practice: Battery life

251 / 377

Practice: Sum of probabilities of two normal random variables

Let \(Z\) be a normal random variable with \(\mu=0\) and \(\sigma=1\). Let \(X\) be a normal random variable with \(\mu=4.3\) and \(\sigma=1.7\).

Determine the values \(P(Z>1) + P(X<6)\) and explain how do you find the value.

252 / 377

Practice: Area under a normal curve

253 / 377

Practice: Find cutoff \(Z\)-scores

254 / 377

Practice: Numbers of Chocolate Chips in Acceptable Cookies

255 / 377

Practice: Blood Pressure

256 / 377

Lab: Excel Functions for Normal Distributions

  • Let \(Z\) be a standard normal random varaible. In Excel, \(P(Z<z)\) is given by NORM.S.DIST(z, TRUE).

  • Let \(X\) be a normal random variable with mean \(\mu\) and standard deviation \(\sigma\), that is \(X\sim \mathcal{N}(\mu, \sigma^2)\). In Excel, \(P(X<x)\) is given by NORM.DIST(x, mean, sd, TRUE).

  • When a cumulative probability \(p=P(X<x)\) of a normal random variable \(X\) is given, we can find \(x\) using NORM.INV(p, mean, sd).

  • When a cumulative probability \(p=P(Z<z)\) of a standard normal random variable \(Z\) is given, we can find \(z\) using NORM.S.INV(p).

257 / 377

Sampling Distributions

258 / 377

Learning Goals for Sampling Distribution

  • Demonstrate understanding of the sampling distribution of a statistic.

  • Explain how the central limit theorem applies in inference.

  • Determine whether a sampling distribution is approximately a normal distribution.

  • Calculate key characteristics (mean, standard error) of the sampling distribution of a statistic.

  • Estimate the probability of an event using the sampling distribution.

259 / 377

Sampling Distribution

  • When using sample statistics to estimate population parameter, there will be a chance error $$\text{Population Parameter}=\text{Sample Statistic}+\text{Chance Error}.$$

  • To understand the chance error, we need to know how sample statistics distribute. Consider samples of the same size \(n\) randomly chosen from the population with replacement.

  • The probability distribution of a sample statistic is called a sampling distribution.

260 / 377

Visualization: Sampling Distribution from a Discrete Random Variable

261 / 377

Visualization: Sampling Distribution from a Continuous Random Variable

262 / 377

Sample Size Affects Standard Error

  • The sampling distribution varies as the sample size changes. In general, A larger sample size will result a smaller standard deviation of the sampling distribution.

  • The standard deviation of a sampling distribution is also called the standard error.

263 / 377

Central Limit Theorem for Mean

  • The Central Limit Theorem:

    As the sample size \(n\) increases, the sampling distribution of the sample mean, from a population with the mean \(\mu\) and the standard deviation \(\sigma\), will approach to a normal distribution with the mean \(\mu_{\bar{X}}=\mu\) and the standard deviation \(\sigma_{\bar{X}}=\dfrac{\sigma}{\sqrt{n}}\).

  • Remark: In terms of standardization, the central limit theorem says that the random variable \(\bar{Z}=\dfrac{\bar{x}-\mu}{\sigma/\sqrt{n}}\) has an approximately standard normal distribution.

264 / 377

Visualization: Central Limit Theorem for Mean

265 / 377

Required Sample Size to Apply the Central Limit Theorem for Mean

  • For most distributions (not highly skewed), when sample size \(n>30\), the sampling distribution of the sample mean \(\bar{X}\) can be approximated reasonably well by a normal distribution. The larger the sample size, the better the approximation will be.

  • When the population is normally distributed, the sampling distribution of the sample means will be normally distributed for any sample size.

  • If the population distribution is highly skewed, relying on CLT can be risky.

266 / 377

Example: Sampling Distribution of a Small Data Set (1/2)

Randomly draw samples of size 2 with replacement from the numbers 1, 3, 4.

  • List all possible samples and calculate the mean of each sample.
  • Find the mean, and standard deviation of the sample means.
  • Find the mean, and standard deviation of the population.
267 / 377

Example: Sampling Distribution of a Small Data Set (1/2)

Randomly draw samples of size 2 with replacement from the numbers 1, 3, 4.

  • List all possible samples and calculate the mean of each sample.
  • Find the mean, and standard deviation of the sample means.
  • Find the mean, and standard deviation of the population.

Solution: Using the Excel function AVERAGE(), we may find means of samples and means of sample means.

Using the Excel function STDEV.P(), we may find the standard deviation of the population and the standard deviation of sample means.

\(\color{red}{\mu}\) \(\color{red}{\sigma}\) \(\color{blue}{\mu_{\bar{X}}}\) \(\color{blue}{\sigma_{\bar{X}}}\)
2.7 1.25 2.7 0.88
sample (1,1) (1,3) (1,4) (3,1) (3,3) (3,4) (4,1) (4,3) (4,4)
\(\bar{X}\) 1 2 2.5 2 3 3.5 2.5 3.5 4

It can be verified that \(\mu_{\bar{X}}=\mu\) and \(\sigma/\sqrt{n}=1.25/\sqrt{2}\approx 0.88=\sigma_{\bar{X}}\).

267 / 377

Example: Sampling Distribution of a Small Data Set (2/2)

Solution: (Continued) The following are the distribution of the population and the distribution of sample means.

268 / 377

Example: Mean Length of Time on Hold

Example: Suppose the mean length of time that a caller is placed on hold when telephoning a customer service center is 23.8 seconds, with standard deviation 4.6 seconds. Find the probability that the mean length of time on hold in a random sample of 1,000 calls will be within 0.5 second of the population mean.

269 / 377

Example: Mean Length of Time on Hold

Example: Suppose the mean length of time that a caller is placed on hold when telephoning a customer service center is 23.8 seconds, with standard deviation 4.6 seconds. Find the probability that the mean length of time on hold in a random sample of 1,000 calls will be within 0.5 second of the population mean.

Solution: Since the sample size \(n=1000>30\) is large enough, by the Central Limit Theorem, we know the mean length of time is approximately normally distributed.

The mean of the sampling distribution is \(\mu_{\bar{X}}=\mu=23.8\).

The standard deviation of the sampling distribution is \(\mu_{\bar{X}}=\dfrac{\sigma}{\sqrt{n}}=\dfrac{4.6}{\sqrt{1000}}\approx 0.15.\)

By the Excel function NORM.DIST(xbar, mean, sd, true), the probability is calculated as $$ \begin{aligned} &P(23.8-0.5<\bar{X}<23.8+0.5)\\ =&P(\bar{X}<24.3)-P(\bar{X}<23.3)\approx 0.9997-0.0003=0.9994. \end{aligned} $$

269 / 377

Example: Normal Distribution vs Sampling of Normal Distribution

Suppose speeds of vehicles on a particular stretch of roadway are normally distributed with mean 36.6 mph and standard deviation 1.7 mph.

  • Find the probability that the speed \(X\) of a randomly selected vehicle is between 35 and 40 mph.
  • Find the probability that the mean speed \(\bar{X}\) of 10 randomly selected vehicles is between 35 and 40 mph.
270 / 377

Example: Normal Distribution vs Sampling of Normal Distribution

Suppose speeds of vehicles on a particular stretch of roadway are normally distributed with mean 36.6 mph and standard deviation 1.7 mph.

  • Find the probability that the speed \(X\) of a randomly selected vehicle is between 35 and 40 mph.
  • Find the probability that the mean speed \(\bar{X}\) of 10 randomly selected vehicles is between 35 and 40 mph.

Solution: In this example, the population is normally distributed. So the sampling distribution of the sample mean is always normally distributed. For calculation, the Excel function NORM.DIST(X, mean, sd, true) will be used.

As \(\mu=36.6\) and \(\sigma=1.7\), the probability that the speed of a vehicle is between 35 and 40 is \(P(35< X< 40)=P(X< 40)-P(X<35)\approx 0.9772-0.1733=0.8039\)

The mean of the sampling distribution is \(\mu_{\bar{x}}=\mu=36.6\). The standard deviation of the sampling distribution is \(\sigma_{\bar{X}}=\sigma/\sqrt{n}=1.7/\sqrt{10}\approx 0.54.\) Then the probability is $$P(35<\bar{X}< 40)=P(\bar{X}< 40)-P(\bar{X}<35)\approx 1-0.0015=0.9985.$$

270 / 377

Sampling Distribution of a Sample Proportion

  • When working with categorical variables, we often study the proportion of a data set.

    The proportion of a specific characteristic in a data set can be viewed as the mean of the data set by identifying the specific characteristic with 1 and others with \(0\).

    Example: Consider the following data set 1, 0, 1, 1, 0, 0, 1, 0, 1, 1 Find proportion of red numbers and the mean of the data set.

    Solution: The proportion of red numbers is \(\frac{6}{10}=0.6\). So is the mean: \(\frac{6\cdot 1 + 4\cdot 0}{10}=0.6\).

  • Consider a population consisting of 1s and 0s. Let \(p\) be the proportion of 1s. Then standard deviation is $$\sigma=\sqrt{(1-p)^2p+(0-p)^2(1-p)}=\sqrt{p(1-p)}.$$

271 / 377

Central Limit Theorem for Proportion

  • For a sampling distribution of sample proportion, we write \(\hat{P}\) for the random variable of sample proportions.

  • Central Limit Theorem for Proportion:

    For large samples, the distribution of sample proportions \(\hat{P}\) is approximately normal, with the mean \(\mu_{\hat{P}}=p\) and standard deviation \(\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}\), where \(p\) is the population proportion.

272 / 377

Required Sample Size to Apply the Central Limit Theorem for Proportion

  • As a sample proportion is always between 0 and 1, and 99.7% of sample proportions lie within 3 standard deviation away from the population proportion, when using the central limit theorem for proportion, we require the sample size \(n\) satisfying the following condition: the interval \(\left[p-3\sqrt{\frac{p(1-p)}{n}}, p+3\sqrt{\frac{p(1-p)}{n}}\right]\) lies wholly in the interval \([0, 1]\).

  • In practice, if \(n\) satisfies the following two inequalities: \(np\ge 10\) and \(n(1-p)\ge 10\), then we consider \(n\) is large enough for assuming that the sampling distribution of the sample proportion is approximately normal.

  • When the population proportion \(p\) is unknown, to apply the central limit theorem for proportion, we require the sample size \(n\) satisfying the same conditions with \(p\) replaced by the sample proportion \(\hat{p}\). That is, the sample size \(n\) should satisfies \(n\hat{p}\ge 10\) and \(n(1-\hat{p})\ge 10\).

273 / 377

Example: Sampling Voters

Suppose that in a population of voters in a certain region 53% are in favor of a particular law. Nine hundred randomly selected voters are asked if they favor the law.

Find the probability that the sample proportion computed from a random sample of size 900 will be at least 2% above true population proportion.

274 / 377

Example: Sampling Voters

Suppose that in a population of voters in a certain region 53% are in favor of a particular law. Nine hundred randomly selected voters are asked if they favor the law.

Find the probability that the sample proportion computed from a random sample of size 900 will be at least 2% above true population proportion.

Solution: We first verify that the sampling distribution is approximately normal.

Since \(p=0.53\) and \(n=900\), \(np=900\cdot 0.53>10\) and \(n(1-p)=900(1-0.53)>10\). By the central limit theorem, the sampling distribution is approximately normal.

The standard deviation of the sampling distribution is \(\sigma_{\hat{P}}=\sqrt{\frac{0.53(1-0.53)}{900}}\approx 0.017\).

Then the probability that the random sample has a proportion at least 2% above 53% is $$P(\hat{P}>0.55)=1-P(\hat{P}\le 0.55)\approx 1-0.8803=0.1197.$$

274 / 377

Example: Traffic Accidents Caused by Distraction

Suppose that in 36% of all car accidents involve injury. Find the probability that the injury rate in a random sample of 250 car accidents is between 30% and 45%.

275 / 377

Example: Traffic Accidents Caused by Distraction

Suppose that in 36% of all car accidents involve injury. Find the probability that the injury rate in a random sample of 250 car accidents is between 30% and 45%.

Solution: Firs we verify that the sample size is large enough to assume the sample proportion is approximately normally distributed by the Central Limit Theorem.

The injury rate of all car accidents is \(p=10\%=0.3\) and the sample size is \(250\). Because \(np=250\cdot 0.36=90>10\) and \(n(1-p)=250\cdot(1-0.36)=160>10\), the sample size is considered large enough.

Let \(\hat{P}\) be the sample proportion of a random sample. By the Central Limit Theorem, the distribution of \(\hat{P}\) is approximately normally with the mean \(p=0.36\) and standard deviation \(\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}\approx 0.03\)

Using the Excel function, NORM.DIST(x, mean, SD, TRUE), we find the probability of a random sample of 250 car accidents with the injury rate between 30% and 45% is P(0.30<P^<0.45)=P(P^<0.45)P(P^<0.30)0.9990.023=0.976

275 / 377

Practice: Sample mean within an interval

An unknown distribution has a mean of 28 and a standard deviation 6. Samples of size n = 30 are drawn randomly from the population. Find the probability that the sample mean is between 27 and 30.

276 / 377

Practice: Sample mean of GPA

The numerical population of grade point averages at a college has mean 2.61 and standard deviation 0.5. If a random sample of size 100 is taken from the population, what is the probability that the sample mean will be between 2.51 and 2.71?

277 / 377

Practice: Proportion of red candy

278 / 377

Practice: Proportion of voting

In a mayoral election, based on a poll, a newspaper reported that the current mayor received 45% of the vote. If this is true, what is the probability that a random sample of 100 voters had less than 35% voting for the current mayor?

279 / 377

Practice: Sampling Distribution of Mean with Unknown Population Distribution

A population has mean 73.5 and standard deviation 2.5.

  1. Find the mean and standard deviation of \(\bar{X}\) for samples of size 30.
  2. Find the probability that the mean of a sample of size 30 will be less than 72.
280 / 377

Practice: Sampling Distribution of Mean with Normal Population Distribution

A normally distributed population has mean 57.7 and standard deviation 12.1.

  1. Find the probability that a single randomly selected element X of the population is less than 45.
  2. Find the mean and standard deviation of \(\bar{X}\) for samples of size 16.
  3. Find the probability that the mean of a sample of size 16 drawn from this population is less than 45.
281 / 377

Practice: Cholesterol Level in Large Eggs

Suppose the mean amount of cholesterol in eggs labeled “large” is 186 milligrams, with standard deviation 7 milligrams. Find the probability that the mean amount of cholesterol in a sample of 144 eggs will be within 2 milligrams of the population mean.

282 / 377

Practice: Color Blindness Rate

Suppose that 8% of all males suffer some form of color blindness. Find the probability that in a random sample of 250 men at least 10% will suffer some form of color blindness.

283 / 377

Practice: Testing an Airline's Claim

An airline claims that 72% of all its flights to a certain region arrive on time. In a random sample of 30 recent arrivals, 19 were on time. You may assume that the normal distribution applies.

  1. Compute the sample proportion.
  2. Assuming the airline’s claim is true, find the probability of a sample of size 30 producing a sample proportion so low as was observed in this sample.
284 / 377

Practice: Minimal Mean Weight of a Particular Fruit

285 / 377

Lab: The NORM.DIST() Function

  • Let \(X\) be a normal random variable with mean \(\mu\) and standard deviation \(\sigma\), that is \(X\sim \mathcal{N}(\mu, \sigma^2)\). In Excel, \(P(X<x)\) is given by NORM.DIST(x, mean, sd, TRUE).

  • The \(Z\)-score of a value \(x\) of a Random variable can be calculated using the Excel function STANDARDIZE(x, mean, sd).

  • Let \(Z\) be a standard normal random variable. The probability \(P(Z<z)\) can be calculated using the Excel function NORM.S.DIST(z, TRUE)

286 / 377

Confidence Intervals for Mean

287 / 377

Learning Goals for Confidence Intervals

  • Determine whether the study meets the conditions under which inferences on a population parameter may be performed.

  • Demonstrate understating of confidence level \(1-\alpha\).

  • Explain when and why to use the normal distribution or the t-distribution for a given study.

  • Determine the appropriate degrees of freedom associated with the t-distribution.

  • Determine the critical values using tables or Excel functions.

  • Describe how the following will affect the width of the confidence interval:

    • increasing the sample size;
    • increasing the confidence level.
  • Construct and interpret a confidence intervals for one population mean.

288 / 377

Point Estimation

  • When estimating a population parameter, we may consider the statistic of a random sample as an estimate of the population parameter. But we expect some chance error.

  • Estimating an unknown parameter by a single number calculated from a sample is called a point estimation. The single number (statistic) from the sample is called a point estimate.

  • Point estimate gives no indication of how reliable the estimate is or how large the error is.

289 / 377

Example: Estimating Population Proportion by a Sample Proportion

From a box of 20 pencils of two colors, black and blue, 10 pencils were randomly drawn. 6 out of the 10 pencils are black. What proportion of black pencils are in the box.

Solution: Since the sample proportion is 0.6, one may make a point estimation that 60% of the box, or 12 are black pencils. However, we don't know how close the sample proportion is to the population proportion.

290 / 377

Interval Estimation

  • To increase the chance, we estimate an unknown parameter using intervals that are obtained by adding chance errors to a point estimate.

  • Estimating an unknown parameter using an interval of values which likely contains the true value of the parameter is called a interval estimation. The interval is called an interval estimate.

  • The reliability of an interval estimate is measured by the probability \(1-\alpha\) that the interval estimate will capture the true value of the parameter. This probability \(1-\alpha\) is called the confidence level.

  • The 90%, 95% and 99% level of confidence are frequently used in statistical study. The 95% level of confidence is usually the standard choice of confidence level for scientific polls published in the media and online.

291 / 377

Example: Average GPA Falling in an Interval

Recall that the standard error of a statistic, denoted by SE, is the standard deviation of the sampling distribution.

A randomly selected 100 students at a college have an average GPA 3.0. How likely does the interval \([3.0-2\cdot\text{SE}, 3.0+2\cdot\text{SE}]\) contain the average GPA \(\mu\) of that college?

292 / 377

Example: Average GPA Falling in an Interval

Recall that the standard error of a statistic, denoted by SE, is the standard deviation of the sampling distribution.

A randomly selected 100 students at a college have an average GPA 3.0. How likely does the interval \([3.0-2\cdot\text{SE}, 3.0+2\cdot\text{SE}]\) contain the average GPA \(\mu\) of that college?

Solution: The probability that the interval \([3.0-2\cdot\text{SE}, 3.0+2\cdot\text{SE}]\) contains the population mean \(\mu\) equals the probability that the sample statistic 3.0 lies in the interval \([\mu-2\cdot\text{SE}, \mu+2\cdot\text{SE}]\). Since, \([\mu-2\cdot\text{SE}, \mu+2\cdot\text{SE}]\) contains 95.5% of data of the population.

That means, we can be 95.5% confidence that the interval \([3.0-2\cdot\text{SE}, 3.0+2\cdot\text{SE}]\) contains the average GPA \(\mu\) of that college.

292 / 377

Confidence Interval

  • When the sampling distribution of a statistic is approximately symmetric, we take interval estimates in the following form \([\text{Statistic}- \text{E}, \text{Statistic}+ \text{E}],\) where the value \(\text{E}\) is called the marginal error or margin of error.

  • Given a confidence level \(100(1-\alpha)\%\), the marginal error \(\text{E}\) is the value such that \(100(1-\alpha)\%\) of the intervals \([\text{Statistic}- \text{E}, \text{Statistic}+ \text{E}]\) contains the true parameter \(\mu_\text{par}\). Equivalently, the marginal error \(\text{E}\) is the value such that \(100(1-\alpha)\%\) of statistics are in the interval \([\mu_\text{par}- \text{E}, \mu_\text{par}+ \text{E}]\).

  • Denote by \(X\) the random variable for the sample statistic. Then \(\text{E}\) is determined the following probability equation $$P(\mu_\text{par}-\text{E}< X < \mu_\text{par}+\text{E})=1-\alpha.$$

    If the distribution of \(X\) is symmetric, then the marginal error \(E\) is the value such that $$P(X-\mu_\text{par}<\text{E})=1-\alpha/2.$$

293 / 377

Visualization: Confidence Interval for Mean

294 / 377

Confidence Intervals for Mean with Known Population SD

  • Suppose the population standard deviation \(\sigma\) is given. By the central limit theorem, if \(n>30\) or the population distribution is approximately normal, then the sampling distribution is approximately normal with the standard error \(\sigma/\sqrt{n}\).

    At the confidence level \(1-\alpha\), the marginal error \(E\) for a population mean \(\mu\) is $$E=z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$$ and the confidence interval is $$\left[\bar{x}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}}, \bar{x}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right],$$ where the critical value \(z_{\alpha/2}\) satisfies that \(P(Z<z_{\alpha/2})=1-\alpha/2\) for the standard normal variable \(Z\).

  • In Excel, \(z_{\alpha/2}\)=NORM.S.INV(1-α/2). Symmetrically, \(z_{\alpha/2}\)=-NORM.S.INV(α/2).

295 / 377

Example: Find Critical Values

A sample of size 15 drawn from a normally distributed population with the standard deviation 6. Find the critical value \(z_{\alpha/2}\) needed in construction of a confidence interval:

  1. when the level of confidence is 90%;
  2. when the level of confidence is 98%.
296 / 377

Example: Find Critical Values

A sample of size 15 drawn from a normally distributed population with the standard deviation 6. Find the critical value \(z_{\alpha/2}\) needed in construction of a confidence interval:

  1. when the level of confidence is 90%;
  2. when the level of confidence is 98%.

Solution: One can find the critical value \(z_{\alpha/2}\) by using the normal distribution table. Here, we will use the Excel function NORM.S.INV(prob)

  1. By our definition, \(1-\alpha=0.9\). Then \(\alpha=1-0.9=0.1\). Using the Excel function NORM.S.INV(1-0.1/2), we get the critical value $$z_{\alpha/2}=1.6449.$$

  2. By our definition, \(1-\alpha=0.98\). Then \(\alpha=1-0.98=0.02\). Using the Excel function NORM.S.INV(1-0.02/2), we get the critical value $$z_{\alpha/2}=2.3263.$$

296 / 377

Example: Mean GPA with Known Population SD

A random sample of 50 students from a college gives a mean GPA 2.51. Suppose the standard deviation of GPA of all students at the college is 0.43. Construct a 99% confidence interval for the mean GPA of all students at the college.

297 / 377

Example: Mean GPA with Known Population SD

A random sample of 50 students from a college gives a mean GPA 2.51. Suppose the standard deviation of GPA of all students at the college is 0.43. Construct a 99% confidence interval for the mean GPA of all students at the college.

Solution: We first gather information from the question:

  • The sample size is \(n=50\),
  • The sample mean is \(\bar{x}=2.51\),
  • The population standard deviation is \(\sigma=0.43\), and
  • The confidence level is \(1-\alpha=0.99\) which implies that \(\alpha/2=0.005\).

Now let's find the critical value and the standard error.

  • The critical value \(z_{0.005}\)=-NORM.S.INV(0.005) \(\approx 2.576\)
  • The standard error is \(\sigma_{\bar{x}}=\sigma/\sqrt{n}=0.43/\sqrt{50}\approx 0.06.\)

Then the marginal error is \(\text{E}=z_{0.005}\cdot\sigma_{\bar{x}}=2.576\cdot 0.06\approx 0.16.\) One may conclude with the 99% confidence that the true mean GPA of all students is contained in the confidence interval \([2.51-0.16, 2.51+0.16]=[2.45, 2.67]\).

297 / 377

Student's \(t\)-Distribution

  • When the population standard deviation is unknown, we may replace \(\sigma\) by the sample standard deviation \(s\) and use \(s/\sqrt{n}\) as an estimate to the standard error for the sampling distribution of the sample mean.

  • When we use the estimated standard error \(s/\sqrt{n}\) to build a confidence interval, the normal distribution may NOT be appropriate for calculating the critical value.

  • Indeed, if the random variable \(\bar{x}\) is approximately normal, then the random variable \(t=\dfrac{\bar{x}-\mu}{s/\sqrt{n}}\) has a Student's \(t\)-distribution with the degree of freedom \(n-1\).

t-curves

298 / 377
  • This result was discovered by William Gosset, an employee of the Guinness brewing company, who published his result using the name Student.

  • Unlike in the case of a sample proportion, the sample standard deviation \(s\) is not determined by the sample mean \(\bar{x}\).

Properties of Student's \(t\)-Distribution

  • The \(t\)-distributions is a family of curves, called \(t\)-curves, parameterized by the degrees of freedom.

  • The \(t\)-distribution has the following important properties.

    1. Similar to the standard normal curve, it is symmetric about 0 and the total area under a \(t\)-curve is 1.
    2. The \(t\)-distribution has slightly more variation (i.e. \(t\)-curves are slightly “fatter”) than the standard normal distribution.
    3. When the degree of freedom increases, the \(t\)-distribution becomes closer to the standard normal distribution.
  • In practice, when the sample size is large enough \(n>30\), people use normal distribution as an approximation for the Student \(t\)-distribution.

299 / 377

Visualization: \(t\)-distributions

300 / 377

Confidence Intervals for a Mean with Unknown Population SD

  • Suppose the sampling distribution is approximately normal. At the confidence level \(1-\alpha\), the margin of error is $$E=t_{\alpha/2}\frac{s}{\sqrt{n}},$$ and the confidence interval for a population mean \(\mu\) is $$\left[\bar{x}-t_{\alpha/2}\frac{s}{\sqrt{n}}, \bar{x}+t_{\alpha/2}\frac{s}{\sqrt{n}}\right],$$ where \(t_{\alpha/2}\) is the critical value such that \(P(T<t_{\alpha/2})=1-\alpha/2\) for a Student \(t\)-distribution with degree of freedom \(n-1\).

  • In Excel, the critical value \(t_{\alpha/2}\) can be calculated by T.INV(1-α/2, n-1) or T.INV.2T(1-α, n-1), where \(n\) is the sample size.

301 / 377

Example: Critical Values for \(t\)-Distributions

A sample of size 15 drawn from a normally distributed population. Find the critical value \(t_{\alpha/2}\) needed in construction of a confidence interval:

  1. when the level of confidence is 99%;
  2. when the level of confidence is 95%.
302 / 377

Example: Critical Values for \(t\)-Distributions

A sample of size 15 drawn from a normally distributed population. Find the critical value \(t_{\alpha/2}\) needed in construction of a confidence interval:

  1. when the level of confidence is 99%;
  2. when the level of confidence is 95%.

Solution: To find the critical value \(t_{\alpha/2}\), we may use the Excel function T.INV(left tail area, df) or T.INV.2T(tail areas, df).

  1. Since \(1-\alpha=0.99\), we have \(\alpha/2=(1-0.99)/2=0.005\) and \(t_{0.005}\) =T.INV.2T(0.01, 14)=-T.INV(0.005, 14)=T.INV(1-0.005, 14)=2.9768.

  2. Since \(1-\alpha=0.95\), we have \(\alpha/2=(1-0.95)/2=0.025\) and \(t_{0.025}\) =T.INV.2T(0.01, 14)=-T.INV(0.025, 14)=T.INV(1-0.025, 14)=2.1448.

302 / 377

Example: Confidence Interval with Unknown Population SD

A sample of size 16 is randomly drawn from a normally distributed population. The sample has a mean 79 and standard deviation 7. Construct a confidence interval for that population mean at the 90% level of confidence.

303 / 377

Example: Confidence Interval with Unknown Population SD

A sample of size 16 is randomly drawn from a normally distributed population. The sample has a mean 79 and standard deviation 7. Construct a confidence interval for that population mean at the 90% level of confidence.

Solution: Since the population is normally distributed, and the population standard deviation is unknown, we apply the formula \(\text{E}=t_{\alpha/2}\cdot\dfrac{s}{\sqrt{n}}\) for marginal error.

Since the sample size is 16, the degree of freedom is df=15.

At 90% confidence level, \(\alpha=1-0.9=0.1\).

From the table or using the function T.INV.2T(0.1, 15), we find that \(t_{0.05}\approx 1.753\).

Then the marginal error is \(\text{E}=1.753\cdot 7/\sqrt{16}\approx 3\). Thus \(\bar{x}-\text{E}=79-3=76\) and \(\bar{x}+\text{E}=79+3=82\).

With 90% confidence, we may conclude that the true population mean is contained in the interval \([76, 82]\).

303 / 377

Example: Average Working Hours in Grocery Stores

The data blow shows numbers of hours worked from 40 randomly selected employees from several grocery stores in the county.

30 26 33 26 26 33 31 31 21 37 27 20 34 35 30 24 38 34 39 31
22 30 23 23 31 44 31 33 33 26 27 28 25 35 23 32 29 31 25 27

Construct 99% confidence interval for the mean worked time.

304 / 377

Example: Average Working Hours in Grocery Stores

The data blow shows numbers of hours worked from 40 randomly selected employees from several grocery stores in the county.

30 26 33 26 26 33 31 31 21 37 27 20 34 35 30 24 38 34 39 31
22 30 23 23 31 44 31 33 33 26 27 28 25 35 23 32 29 31 25 27

Construct 99% confidence interval for the mean worked time.

Solution: Since the sample size is 40 (>30), by the central limit theorem, the sample mean is approximately normally distributed. Using the Excel functions AVERAGE() and STDEV.S() to the data, we find \(\bar{x}\approx 29.6\) and \(s\approx 5.3\).

Since \(df=40-1=39\) and \(\alpha=1-0.99=0.01\), the critical value \(t_{0.005}\approx 2.708\) and the marginal error is \(\text{E}=2.708\cdot 5.3/\sqrt{40}\approx 2.3\). Thus \(\bar{x}-\text{E}=29.6-2.3=27.3\) and \(\bar{x}+\text{E}=29.6+2.3=31.9\)

With a 99% confidence, one may conclude that the true mean worked hours is contained in the interval \([27.3, 31.9]\).

304 / 377

Choose Between Normal Distribution and \(t\)-Distribution

  • Population is approximately normally distributed.

    • the population standard deviation \(\sigma\) is known: use the normal distribution.
    • the population standard deviation \(\sigma\) is unknown: use the \(t\)-distribution.
  • Population distribution unknown, but sample size is large enough, i.e. \(n>30\).

    • the population standard deviation \(\sigma\) is known: use normal distribution.
    • the population standard deviation \(\sigma\) is unknown: either one can be used but the \(t\)-distribution is more accurate.
  • Warning: Population distribution unknown and the sample size is small, neither the \(t\)-distribution nor the normal distribution is reliable.

  • For small samples, there is method called "The Shapiro–Wilk test" which can be used to determine if we may assume the sampling distribution is approximately normal.

  • Even \(n>30\), we should have a visual inspection (using histogram for example) of the normality.

305 / 377

Practice: Conceptual Questions on Confidence Intervals

Decide whether the following statements are true or false. Explain your reasoning.

  • The statement, "the 95% confidence interval for the population mean is (350, 400)" means that 95% of the population values are between 350 and 400.
  • For a given standard error, lower confidence levels produce wider confidence intervals.
  • If you increase sample size, the width of confidence intervals will increase.
  • If you take large random samples over and over again from the same population, and make 95% confidence intervals for the population average, about 95% of the intervals should contain the population average.
306 / 377

Practice: Confidence Interval sigma known SAT Scores

307 / 377

Practice: Find the Critical Value

308 / 377

Practice: Find the Marginal Error

309 / 377

Practice: How Much Alcohol Do College Students Drink

A statistics student is curious about drinking habits of students at his college. He wants to estimate the mean number of alcoholic drinks consumed each week by students at his college. He plans to use a 90% confidence interval. He surveys a random sample of 71 students. The sample mean is 3.93 alcoholic drinks per week. The sample standard deviation is 3.78 drinks.

310 / 377

Practice: Estimating Average Distance from Home to Workplace

Four hundred randomly selected working adults in a certain state, including those who worked at home, were asked the distance from their home to their workplace. The average distance was 8.84 miles with standard deviation 2.70 miles.

Construct a 98% confidence interval for the mean distance from home to work for all residents of this state.

311 / 377

Practice: Estimate Mean Lifetime

City planners wish to estimate the mean lifetime of the most commonly planted trees in urban settings. A sample of 16 recently felled trees yielded mean age 32.7 years with standard deviation 3.1 years. Assuming the lifetimes of all such trees are normally distributed, construct a 99.8% confidence interval for the mean lifetime of all such trees.

312 / 377

Practice: Confidence Interval from a small data set

313 / 377

Lab: Excel Functions for \(t\)-Distributions

Suppose a Student's \(t\)-distribution has the degree of freedom \(\text{df}=n-1\).

  • Find a probability for a given \(t\)-value.

    • The area of the left tail of the \(t\)-value may be calculated by the function T.DIST(t,df,true).

    • The area of the right tail of the \(t\)-value may be calculated by the function T.DIST.R(t,df).

    • The area of two tails of the \(t\)-value (here \(t\)>0) may be calculated by function T.DIST.2T(t,df).

  • Find the critical value for a given probability \(p\).

    • When the area of the left tail is given, the function T.INV(p,df) may be used.

    • When the area of both tails is given, the function T.INV.2T(p,df) may be used. This function is good for construction confidence interval.

314 / 377

Confidence Intervals for Proportions

315 / 377

Learning Goals for Confidence Intervals for Proportions

  • Construct and interpret a confidence intervals for one population proportion.

  • Describe how the following will affect the width of the confidence interval:

    • increasing the sample size;

    • increasing the confidence level.

316 / 377

Confidence Intervals for a Proportion

  • Recall that the standard error of sample proportions is \(\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}\), where \(n\) is the sample size and \(p\) is the population proportion. As an consequence, when estimating the population proportion \(p\), we only have a point estimate $$\hat{\sigma}_{\hat{p}}=\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}$$ for the standard error of sample proportions, that is, $$\sigma_{\hat{p}}\approx\hat{\sigma}_{\hat{p}}=\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}.$$

  • Based on the central limit theorem, when \(n\) is large enough, at the \(100(1-\alpha)\%\) level, the margin of error for \(p\) is defined as $$E=z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},$$ and the confidence interval for \(p\) is defined by $$\left[\hat{p}-z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right],$$ where the critical value \(z_{\alpha/2}\) satisfies that \(P(Z<-z_{\alpha/2})=\alpha/2\) for the standard normal variable \(Z\). In Excel, \(z_\alpha/2\)=-NORM.S.INV(α/2)=NORM.S.INV(1-α/2).

  • The sample size \(n\) is considered large enough if \(n\hat{p}\ge 10\) and \(n(1-\hat{p})\ge 10\).

  • The above defined confidence interval is known as the normal approximation (or Wald's) confidence interval. It is popular in introductory statistics books. However, it is unreliable when the sample size is small or the sample proportion is close to 0 or 1. Indeed, if the sample proportion is 0 or 1, the confidence interval defined here will have zero length.

317 / 377

By the central limit theorem, the random variable \(\hat{p}\) is normal distributed. The chance that \(p\in \left[\hat{p}-z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right]\) is the same as the chance that \(\hat{p}\in \left[p-z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}, p+z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\right]\). That shows \(z_{\alpha/2}\) satisfying $$P(-z_{\alpha/2}<\dfrac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}<z_{\alpha/2})=1-\alpha.$$

Example: Estimating the Proportion of Students Taking Busses (1/2)

In a random sample of 100 students in college, 65 said that they come to college by bus.

  1. Give a point estimate of the proportion of all students who come to college by bus.

  2. Construct a 99% confidence interval for that proportion.

318 / 377

Example: Estimating the Proportion of Students Taking Busses (1/2)

In a random sample of 100 students in college, 65 said that they come to college by bus.

  1. Give a point estimate of the proportion of all students who come to college by bus.

  2. Construct a 99% confidence interval for that proportion.

Solution: A good point estimate would be a sample proportion. Here the sample proportion is \(\hat{p}=65/100=0.65\).

As \(n\hat{p}=100\cdot 0.65=65>10\) and \(n(1-\hat{p})=100\cdot 0.35=35>10\), which implies the sample is large enough, approximately the standard error is $$\hat{\sigma}_{\hat{P}}=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=\sqrt{\frac{0.65(1-0.65)}{100}}\approx0.048.$$

318 / 377

Example: Estimating the Proportion of Students Taking Busses (2/2)

Solution: (Continued) At 99% level of confidence, the value \(\alpha=1-0.99=0.01\) and \(\alpha/2=0.01/2=0.005\). The critical value \(z_{\alpha/2}\) is determined by the equation \(P(Z<-z_{0.005})=0.005\) (or equivalently \(P(Z<z_{0.005})=1-0.005\)). Using the Excel function -NORM.S.INV(α/2) (or NORM.S.INV(1-α/2)), we find the critical value \(z_{0.005}\approx 2.576\).

Thus the marginal error is $$E=z_{\alpha/2}\cdot \hat{\sigma}_{\hat{P}}=2.576\cdot 0.048=0.123,$$ and the confidence interval at 99% level is $$[\hat{p}-E, \hat{p}+E]\approx [0.65-0.123, 0.65+0.123]=[0.527, 0.773].$$

Conclusion: we are 99% confident that the true proportion of all students at the college who take bus lies in the interval \([0.527, 0.773]\).

319 / 377

Example: Estimating a Population Proportion (1/2)

Foothill College’s athletic department wants to calculate the proportion of students who have attended a women’s basketball game at the college. They use student email addresses, randomly choose 220 students, and email them. Of the 145 who responded, 22 had attended a women’s basketball game.

Calculate and interpret the approximate 90% confidence interval for the proportion of all Foothill College students who have attended a women’s basketball game.

320 / 377

Example: Estimating a Population Proportion (1/2)

Foothill College’s athletic department wants to calculate the proportion of students who have attended a women’s basketball game at the college. They use student email addresses, randomly choose 220 students, and email them. Of the 145 who responded, 22 had attended a women’s basketball game.

Calculate and interpret the approximate 90% confidence interval for the proportion of all Foothill College students who have attended a women’s basketball game.

Solution: Although 220 students were surveyed, only 145 responded. So the sample consists of those 145 students. The sample proportion is \(\hat{p}=\frac{22}{145}=0.152\).

The estimated standard error is $$\hat{\sigma}_{\hat{P}}=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=\sqrt{\frac{0.152\cdot(1-0.152)}{145}}\approx 0.03.$$

320 / 377

Example: Estimating a Population Proportion (2/2)

Solution: (Continued) At the 90% confidence level, \(\alpha=1-0.9=0.1\) and the critical value is \(z_{\alpha/2}\)=NORM.S.INV(1-0.1/2) \(\approx 1.645\).

Therefore, the marginal error is $$E=z_{\alpha/2}\cdot\hat{\sigma}_{\hat{P}}\approx 1.645\cdot0.03=0.049,$$ and the confidence interval is $$[\hat{p}-E, \hat{p}-E]\approx[0.152-0.049, 0.152+0.049]=[0.103, 0.049].$$

Conclusion: we are 90% confident that the proportion of all Foothill College students who have attended a women’s basketball game is between 0.103 and 0.201.

321 / 377

Factors Affect the Width of Confidence Intervals

  • The width of a confidence interval, equals twice the standard error, gives a measure of precision of the estimation.

  • Recall, for population proportion and mean, $$\text{Marginal Error} = \text{Critical Value}\cdot \frac{\text{(estimated) Population SD}}{\sqrt{\text{Sample Size}}}$$

  • The formula tells us the precision of a confidence interval is affected by the confidence level, the variability, and the sample size.

    • Larger the confidence levels give larger critical values and errors.

    • Populations (and samples) with more variability gives larger errors.

    • Larger sample sizes give smaller errors.

322 / 377

Sample Size Determination

  • In practice, we may desire a marginal error of \(E\). With a fixed confidence level \(100(1-\alpha)\%\), the larger the sample size the smaller the marginal error.

  • When estimating population proportion, if we can produce a reasonable guess \(\hat{p}\) for population proportion, then an appropriate minimum sample size for the study is determined by $$n=\left(\frac{z_{\alpha/2}}{{E}}\right)^2\cdot \hat{p}(1-\hat{p}).$$

  • When estimating population mean, if we can produce a reasonable guess \(\sigma\) for the population standard deviation, then an appropriate minimum sample size is given by $$n=\left(\dfrac{z_{\alpha/2}\cdot \sigma}{{E}}\right)^2.$$

323 / 377

Example: Minimum Sample Size - Error in Proportion

Suppose you want to estimate the proportion of students at QCC who live in Queens. By surveying your classmates, you find around 70% live in Queens. Use this as a guess to determine how many students would need to be included in a random sample if you wanted the error of margin for a 95% confidence interval to be less than or equal to 2%.

Solution: We may use \(\hat{p}=0.7\) as a reasonable guess for the population proportion.

At the 95% level, the critical value is \(z_{0.025}=\) NORM.S.INV(1-0.025) \(\approx 1.96\).

The marginal error is \(E=0.02\).

Then the appropriate minimal sample size is determined by $$n=\left(\frac{z_{\alpha/2}}{{E}}\right)^2\cdot \hat{p}(1-\hat{p})=(1.96/0.02)^2\cdot 0.7\cdot(1-0.7)=2016.84.$$

Since the sample size has to be an integer, to get a error no more than 2% at the level 95%, the minimal sample size should be at least 2017.

324 / 377

Example: Minimum Sample Size - Error in Mean

Find the minimum sample size necessary to construct a 99% confidence interval for the population mean with a margin of error \(E =0.2\). Assume that the estimated population standard deviation is \(\sigma=1.3\).

Solution: At the 99% level, we have \(\alpha/2=(1-0.99)/2=0.005\).

The critical value \(z_{0.005}\) NORM.S.INV(1-0.005) \(\approx 2.576\).

The desired marginal error is \({E}=0.2\).

The estimated population standard deviation is \(\sigma=1.3\).

Then the minimal sample size is approximately $$n=\left(\dfrac{z_{\alpha/2}\cdot \sigma}{{E}}\right)^2\approx (2.576\cdot 1.3/0.2)^2 \approx 280.4.$$

To get a error no more than 0.2 at the level 95%, the minimal sample size should be at least 281.

325 / 377

Practice: Conceptual Questions on Confidence Intervals

Decide whether the following statements are true or false. Explain your reasoning.

  • The statement, "the 95% confidence interval for the population mean is (350, 400)" means that 95% of the population values are between 350 and 400.
  • For a given standard error, lower confidence levels produce wider confidence intervals.
  • If you increase sample size, the width of confidence intervals will increase.
  • If you take large random samples over and over again from the same population, and make 95% confidence intervals for the population average, about 95% of the intervals should contain the population average.
326 / 377

Practice: Find Confidence Interval of Proportion of Kids

327 / 377

Practice: Confidence Intervals for a Population Proportion

To understand the reason for returned goods, the manager of a store examines the records on 40 products that were returned in the last year. Reasons were coded by 1 for “defective,” 2 for “unsatisfactory,” and 0 for all other reasons, with the results shown in the table.

0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 2 0 2 0 0 0 0 0 0 2 0 0
  1. Give a point estimate of the proportion of all returns that are because of something wrong with the product, that is, either defective or performed unsatisfactorily.

  2. Construct an 80% confidence interval for the proportion of all returns that are because of something wrong with the product.

328 / 377

Practice: Find Confidence Interval of Proportion Given Table

329 / 377

Practice: Minimum Sample Size - Mean

A software engineer wishes to estimate, to within 5 seconds, the mean time that a new application takes to start up, with 95% confidence. Estimate the minimum size sample required if the standard deviation of start up times for similar software is 12 seconds.

330 / 377

Practice: Minimum Sample Size - Proportion

The administration at a college wishes to estimate, to within two percentage points, the proportion of all its entering freshmen who graduate within four years, with 90% confidence. Estimate the minimum size sample required.

331 / 377

Lab: Excel Functions for Normal Distributions

  • Let \(Z\) be a standard normal random varaible. In Excel, \(P(Z<z)\) is given by NORM.S.DIST(z,TRUE).

  • Let \(X\) be a normal random variable with mean \(\mu\) and standard deviation \(\sigma\), that is \(X\sim \mathcal{N}(\mu, \sigma^2)\). In Excel, \(P(X<x)\) is given by NORM.DIST(x,mean,sd,TRUE).

  • When a cumulative probability \(p=P(X<x)\) of a normal random variable \(X\) is given, we can find \(x\) using NORM.INV(p,mean,sd).

  • When a cumulative probability \(p=P(Z<z)\) of a standard normal random variable \(Z\) is given, we can find \(z\) using NORM.S.INV(p).

332 / 377

Concepts of Hypothesis Testing

333 / 377

Learning Goals for Hypothesis Tests

  • Choose appropriate null and alternative hypotheses.

  • Determine whether the test should be one-sided or two-sided.

  • Calculate \(Z\)-test statistics and \(T\)-test statistics.

  • Calculate the \(P\)-value or the rejection region.

  • Determine whether to reject or fail reject the alternative hypotheses.

  • Interpret the results of a test of significance in context.

334 / 377

The Basic Idea of Hypothesis Testing

  • The testing procedure starts with an initial assumption that the statement on population parameter is true.

  • We test this initial assumption using a random sample. If the initial assumption is really the truth, then the test statistic from a random sample shouldn't be too far away from the center of the sampling distribution. Conversely, if the test statistic is too far away from the center, then we should not believe in the initial assumption.

  • To determine how far is too far away, we need to specify a threshold, a prior probability, or equivalently a critical value.

  • If the test statistic is at least extreme as the critical value, then the testing is significant enough to allow us to reject the initial assumption. Otherwise, we cannot draw a definite conclusion.

  • The prior probability measures the chance that the initial assumption was wrongly rejected.

335 / 377

Two Hypotheses

  • A statistical hypothesis is a statement about a population parameter.

  • A hypothesis test is a process that uses sample statistics to test a hypothesis.

  • To test a population parameter, we choose a pair of hypotheses, the null hypothesis and the alternative hypothesis which are contradictory to each other.

  • The null hypothesis, denoted by \(H_0\), is the statement about the population parameter that is assumed to be true.

  • The alternative hypothesis, denoted \(H_a\), is a statement about the population parameter that is contradictory to the null hypothesis.

336 / 377

Example: Identify the Null and the Alternative Hypotheses

  1. Test a statement that the population mean is 1.
  2. Test a statement that the population mean is more than 3.
  3. Test a statement that the population mean is no more than 3.
337 / 377

Example: Identify the Null and the Alternative Hypotheses

  1. Test a statement that the population mean is 1.
  2. Test a statement that the population mean is more than 3.
  3. Test a statement that the population mean is no more than 3.

Solution: Keep in mind that the null hypothesis should always contains the equal sign. The alternative hypothesis is contrary to the null hypothesis.

  1. We may set set the null hypothesis as \(H_0\): \(μ = 1\). Depending on the given information, otherwise, we may set the alternative hypothesis as \(H_a\): \(μ\ne 1\).
  2. We may set set the null hypothesis as \(H_0\): \(μ = 3\) and the alternative hypothesis as \(H_a\): \(μ>1\).
  3. We may set set the null hypothesis as \(H_0\): \(μ \leq 3\) and the alternative hypothesis as \(H_a\): \(μ>3\).
337 / 377

The Logic of Hypothesis Testing

The logic of hypothesis testing and two types of error can be summarized in the following table.

H0 is true H0 is false
Reject H0 Type I Error Correct decision
Fail to Reject H0 Correct decision Type II Error

The interpretation of hypothesis testing is summarized in the following table.

Testing statement is H0 Testing statement is Ha
Reject H0 There is enough evidence to reject the statement There is enough evidence to support the statement
Fail to Reject H0 There is not enough evidence to reject the statement There is not enough evidence to support the statement
338 / 377

Type of Errors in Hypothesis Testing

  • Rejecting the null hypothesis when it is indeed true is called a type I error. The maximum allowable probability of making a type I error is called the level of significance, denoted by \(\alpha\).

  • Failing to reject the null hypothesis when the it is false is called a type II error. The probability of a type II error is usually denoted by \(\beta\). The power of a hypothesis test, equals \(1-\beta\), is the probability of rejecting the null hypothesis when it is false.

339 / 377

$$\alpha=P(\text{Type I error})= P(\text{reject a true }H_0).$$

Type of Tests

  • If \(H_a\) has the form \(\mu\neq \mu_0\) the test is called a two-tailed test.

  • If \(H_a\) has the form \(\mu<\mu_0\) the test is called a left-tailed test.

  • If \(H_a\) has the form \(\mu>\mu_0\) the test is called a right-tailed test.

  • Each of the last two forms is also called a one-tailed test.

340 / 377

Rejection Region and Critical Value

  • The sample statistic used to test the assumption is called a test statistic.

  • A rejection region is the range of values for which the null hypothesis is unlikely to be true.

Sign in \(H_a\) \(\ne\) \(<\) \(>\)
Rejection region Both sides Left side Right side
  • A critical value is a value that separates the rejection region from its complement. The calculation depends on the sampling distribution of the test statistic.

See the interactive demonstration for rejection regions for hypothesis tests.

  • If a test statistic falls in the rejection region, then we may and will reject the null hypothesis.
341 / 377

Visualization: Rejection Region and Critical Value

Illustration of Rejection Region and Critical Value

342 / 377

Computational Remarks

  • Based on the central limit theorem, when testing a hypothesis on mean or proportion, we will use either a (standard) normal distribution or a Student’s t-distribution.

  • The critical value(s) at the significance level \(\alpha\) can be calculated in two steps:

    1. find the standard critical value
    2. Apply the (inverse) standardization formula $$\text{Critical value}= \text{Std. critical value}\cdot\text{SE}\pm\text{Mean}.$$
  • To make a decision using rejection region, alternately, it is usually more convenient to compare the standard critical value with the standardized test statistic: $$\text{Std. test statistic}=\frac{\text{Test statistic} - \text{Mean}}{\text{SE}}.$$

343 / 377

Standardized Rejection Region

  • The standardized test statistic has a standard normal distribution
Symbol in \(H_a\) Type of Test Rejection Region
\(<\) Left-tailed test \((-\infty, -z_\alpha]\)
\(>\) Right-tailed test \([z_\alpha, \infty)\)
\(\neq\) Two-tailed test \((-\infty, -z_{\alpha/2}]\cup [z_{\alpha/2}, \infty)\)
  • The standardized test statistic has a Student's \(t\)-distribution
Symbol in \(H_a\) Type of Test Rejection Region
\(<\) Left-tailed test \((-\infty, -t_\alpha]\)
\(>\) Right-tailed test \([t_\alpha, \infty)\)
\(\neq\) Two-tailed test \((-\infty, -t_{\alpha/2}]\cup [t_{\alpha/2}, \infty)\)
  • Recall that \(z_{c}\)=NORM.S.INV(1-c) (or \(t_c\)=T.INV(1-c, df)) is the the value such that \(P(Z<z_c)=1-c\) (respectively \(P(T<t_c)=1-c\)).
344 / 377

Example: Make a Decision Using Rejection Region

Suppose the population standard deviation is \(\sigma=4.3\). At the significance level \(\alpha=0.02\), construct the a standardized rejection region for the following test for the population mean

Test \(H_0: \mu=21.6\) vs. \(H_a: \mu<21.6\).

Make a decision if a random sample has the size \(n=70\) and mean \(\bar{x}=20.5\).

345 / 377

Example: Make a Decision Using Rejection Region

Suppose the population standard deviation is \(\sigma=4.3\). At the significance level \(\alpha=0.02\), construct the a standardized rejection region for the following test for the population mean

Test \(H_0: \mu=21.6\) vs. \(H_a: \mu<21.6\).

Make a decision if a random sample has the size \(n=70\) and mean \(\bar{x}=20.5\).

Solution: Due to the form of \(H_a: \mu< 21.6\), the rejection region should contain the left tail.

Then the standard critical value is \(z_{0.02}\)=Norm.S.Inv(1-0.02) \(\approx\) -2.054. So the standard rejection region is \((-\infty, -2.054]\).

Because the standardized test statistic z=x¯μ0σ/n=20.521.64.3/702.14 is in the rejection region. We reject the null hypothesis.

345 / 377

Observed Significance

  • To make a decision, one may also compare probabilities. The observed significance (P-value) of a test statistic is the probability of obtaining a sample statistic at least as extreme as the (observed) test statistic, given that the null hypothesis were true.

  • \(P\)-Value as Tail area

    Sign in \(H_a\) \(\ne\) \(<\) \(>\)
    \(P\)-value Double of the tail area Left tail area Right tail area
  • Making decision by comparing the \(P\)-value with the significance level \(\alpha\):

    • reject \(H_0\) if \(p≤\alpha\) and

    • do not reject \(H_0\) if \(p>\alpha\).

346 / 377

Example: Make a Decision Using the \(P\)-value

Given the following testing hypotheses

\(H_{0}: p=0.50\) vs. \(H_{a}: p\ne 0.50, n=360, \hat{p}=0.56\),

find the \(P\)-value for the test in the above example and make a decision.

347 / 377

Example: Make a Decision Using the \(P\)-value

Given the following testing hypotheses

\(H_{0}: p=0.50\) vs. \(H_{a}: p\ne 0.50, n=360, \hat{p}=0.56\),

find the \(P\)-value for the test in the above example and make a decision.

Solution: Because \(H_a\) is \(\color{purple}{p\ne p_0}\) and \(\color{grey}{\hat{p}=0.56>p_0}\), the \(P\)-value is the double of the right tail area, that is, the \(P\)-value equals \(2P(\hat{p}>0.56)\).

We first find the standard error of the null distribution: $$\text{SE}=\sqrt{p_0(1-p_0)/n}=\sqrt{0.5\cdot0.5/360}=0.03.$$

The \(P\)-value is approximately 0.0455 which can be calculated by the Excel function 2*(1-Norm.Dist(0.56,0.5,0.03,true).

Since the \(P\)-value is smaller than \(\alpha\), we reject the null hypothesis \(H_0\).

347 / 377

Practice: Conceptual Understanding on Hypothesis Testing

Decide whether the following statements are true or false. Explain your reasoning.

  • In case of a left-tailed test, we reject the null hypothesis if the sample statistic is significantly smaller than the hypothesized population parameter.

  • A \(P\)-value of 0.08 is more evidence against the null hypothesis than a \(P\)-value of 0.04.

  • The statement, "the \(P\)-value is 0.03", is equivalent to the statement, "there is a 3% probability that the null hypothesis is true".

  • Even though you rejected the null hypothesis, it may still be true.

  • Failing to reject null hypothesis means the null hypothesis is true.

  • That the \(P\)-value of a sample statistic is \(p=0\) means the null hypothesis cannot be true.

Questions are partially taken from Conceptual questions on hypothesis testing

348 / 377

Practice: Identify Hypotheses and and Determine the Type of Test

349 / 377

Practice: Find the Rejection Region

Suppose the standardized test statistic has \(Z\)-distribution. Find the standard rejection region for each of the following testing scenario.

  • \(H_{0}: \mu=27 \text{ vs. } H_{a} : \mu<27\) with \(\alpha=0.01\)
  • \(H_{0}: \mu=52 \text{ vs. } H_{a} : \mu \neq 52\) with \(\alpha=0.05\)
  • \(H_{0}: \mu=-105 \text{ vs. } H_{a} : \mu>-105\) with \(\alpha=0.02\)
350 / 377

Practice: Find the \(P\)-value

Suppose we’re conducting a hypothesis testing for a population mean. Find the \(P\)-value for each of the following testing scenario with the given sample size \(n\) and the test statistics \(t\).

  • \(H_{0}: \mu=25 \text { vs. } H_{a} : \mu<25\), \(n=30\), \(t=-2.43\).
  • \(H_{0}: \mu=35 \text { vs. } H_{a} : \mu>35\), \(n=50\), \(t=2.13\).
  • \(H_{0}: \mu=-7.9 \text { vs. } H_{a} : \mu\ne-7.9\), \(n=40\), \(t=-1.99\).
351 / 377

Practice: Make a Decision Based on the \(P\)-value

352 / 377

Practice: Interpret a Decision

353 / 377

Lab: Excel Functions for Normal Distributions

  • Let \(Z\) be a standard normal random varaible. In Excel, \(P(Z<z)\) is given by NORM.S.DIST(z,TRUE).

  • Let \(X\) be a normal random variable with mean \(\mu\) and standard deviation \(\sigma\), that is \(X\sim \mathcal{N}(\mu, \sigma^2)\). In Excel, \(P(X<x)\) is given by NORM.DIST(x,mean,sd,TRUE).

  • When a cumulative probability \(p=P(X<x)\) of a normal random variable \(X\) is given, we can find \(x\) using NORM.INV(p,mean,sd).

  • When a cumulative probability \(p=P(Z<z)\) of a standard normal random variable \(Z\) is given, we can find \(z\) using NORM.S.INV(p).

354 / 377

Lab: Excel Functions for \(T\)-Distributions

Suppose a Student's \(T\)-distribution has the degree of freedom \(\text{df}=n-1\).

  • To find a probability for a given \(T\)-value

    • The area of the left tail of the \(T\)-value may be calculated by the function T.DIST(t,df,true).

    • The area of the right tail of the \(T\)-value may be calculated by the function T.DIST.R(t, df).

    • The area of two tails of the \(T\)-value (t>0) may be calculated by function T.DIST.2T(t,df).

  • To find the critical value for a given probability \(p\)

    • When the area of the left tail is given, the function T.INV(p,df) may be used.

    • When the area of both tails is given, the function T.INV.2T(p,df) may be used. This function is good for construction confidence interval.

355 / 377

Hypothesis Testing for One Mean or One Proportion

356 / 377

Learning Goals for Hypothesis Tests

  • Perform an appropriate hypothesis test for a statement about mean using data from a random sample.

  • Perform an appropriate hypothesis test for a statement about proportion using data from a random sample.

357 / 377

Hypothesis Testing Procedure

  1. Check if the sample size is large enough and determine if a \(Z\)-test or \(T\)-test can be performed. For proportion, \(Z\)-test may be used. For mean, if \(\sigma\) is known, the \(Z\)-test may be used. If \(\sigma\) is unknown, the \(T\)-test may be used.

  2. State the null and alternative hypothesis. The null hypothesis always contains the equal sign (and possibly a less than or greater than symbol, depending on \(H_a\).)

  3. Set a significance level \(\alpha\). Commonly used levels are \(\alpha=0.01\), \(\alpha=0.05\) and \(\alpha=0.1\).

  4. Calculate the standardized test statistic: the \(Z\)-test statistic or the \(T\)-test statistic.

  5. Calculate the \(P\)-value, or construct the rejection region. (Recommend to draw pictures.)

    Sign in \(H_a\) \(\ne\) \(<\) \(>\)
    Test Two-tailed Left-tailed Right-tailed
  6. Make a test decision about the null hypothesis \(H_0\). We reject \(H_0\) if the test statistic falls in the rejection region or the \(P\)-value less than the significance level \(\alpha\).

  7. State an overall conclusion.

358 / 377

Some Remarks

  • Test statistics often refer to the standard test statistics which give more details on the relative difference.

  • The \(P\)-value is slightly more popular in hypothesis testing. Because it gives a more detailed explanation of the data and is easier for making decision at different significance levels.

  • A hypothesis test decision may be interpreted using the confidence interval. The rejection region of a hypothesis test can be obtained as the complement of a confidence interval.

  • A hypothesis testing procedure is comparable to a criminal trial: a defendant is considered not guilty as long as his or her guilt is not proven. See Wiki page on Statistical Hypothesis Testing for more detail on the comparison.

359 / 377

Example: Test a Mean with Known SD Using Rejection Region (1/2)

Residences on a certain street claim that the mean speed of automobiles run through the street is greater than the speed limit of 25 miles per hour. A random sample of 100 automobiles has a mean speed of 26 miles per hour. Assume the population standard deviation is 4 miles per hour. Is there enough evidence to support the claim of the residences at the significance level \(\alpha = 0.05\)?

360 / 377

Example: Test a Mean with Known SD Using Rejection Region (1/2)

Residences on a certain street claim that the mean speed of automobiles run through the street is greater than the speed limit of 25 miles per hour. A random sample of 100 automobiles has a mean speed of 26 miles per hour. Assume the population standard deviation is 4 miles per hour. Is there enough evidence to support the claim of the residences at the significance level \(\alpha = 0.05\)?

Solution: The sample size is \(n=100>30\). So the sampling distribution of sample means is approximately normal by the central limit theorem.

To test the claim of the residences, we set \(H_0:\mu=25\) and \(H_a: \mu >25\).

Because \(H_a\) contains the \(>\) sign and \(\sigma\) is known, we use right-tailed \(Z\)-test.

Since the population standard deviation is \(\sigma=4\). We use the standard normal distribution to find the critical value.

With the given significance level \(\alpha=0.05\), we find the critical value is \(z_{0.05}=1.64\) given by the Excel function NORM.S.INV(1-0.05).

360 / 377

Example: Test a Mean with Known SD Using Rejection Region (2/2)

Solution: (Continued) The rejection region is the interval \([1.64, \infty)\).

Right-Tail Test for a Proportion

The \(Z\)-test statistic is \(z=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}=\frac{26-25}{4/\sqrt{100}}=2.5\). It is in the rejection region. So we reject the \(H_0:\mu=25\) and support the \(H_a:\mu>25\).

At the confidence level \(\alpha=0.05\), there is enough evidence to support the claim of the residences that the average speed of automobile is above the speed limit.

361 / 377

Example: Test a Mean with Unknown SD Using Rejection Region (1/2)

A car manufacturer claims that a new fuel injection design increases the mean mileage on a certain model of car above its current 28.5 miles per gallon level. Twenty-five of the new designs were checked and the mean recorded as 30.0 miles per gallon with a standard deviation of 3.8 miles per gallon. Assume that mean mileages are approximately normally distributed. Evaluate this claim at the 5% level of significance.

362 / 377

Example: Test a Mean with Unknown SD Using Rejection Region (1/2)

A car manufacturer claims that a new fuel injection design increases the mean mileage on a certain model of car above its current 28.5 miles per gallon level. Twenty-five of the new designs were checked and the mean recorded as 30.0 miles per gallon with a standard deviation of 3.8 miles per gallon. Assume that mean mileages are approximately normally distributed. Evaluate this claim at the 5% level of significance.

Solution: Since the population is approximately normally distributed, the sampling distribution of sample means is approximately normal by the central limit theorem.

To test the claim of the residence, we set \(H_0:\mu=\mu_0=28.5\) and \(H_a: \mu >28.5\).

Because the alternative hypothesis claims "greater" and \(\sigma\) is unknown, we use right-tailed \(T\)-test.

Since the population standard deviation is unknown. We use the \(T\)-distribution to test the claim.

The degree of freedom is \(\text{df}=25-1=24\) With the given significance level \(\alpha=0.05\), we find the critical value is \(t_{0.05}=1.71\) given by the Excel function T.INV(1-0.05, 24).

362 / 377

Example: Test a Mean with Unknown SD Using Rejection Region (2/2)

Solution: (Continued) The critical region is \([1.71, \infty)\).

Right-Tail Test for a Mean

The \(T\)-test statistic is t=x¯μ0s/n=3028.53.8/25=1.97.

Because \(t=1.97>t_{0.05}=1.71\), that is, the \(T\)-test statistics in in the rejection region. We reject \(H_0\) and support the alternative hypothesis \(H_a:\mu>28.5\).

At the significance level 5%, there is enough evidence to support the claim that new designs increase the mean mileage.

363 / 377

Example: Test a Mean with Unknown SD Using \(P\)-value

Example: A certain manufacturer claims that average numbers of candies in a certain sized bag that they produce is 20. To test the claims, you collected a random sample of 10 bags and find the mean is 18 and the standard deviation is 2.7. Assume the numbers of candies are normally distributed. At the significance level \(\alpha=0.05\), does your analysis support the manufacturer's claim?

364 / 377

Example: Test a Mean with Unknown SD Using \(P\)-value

Example: A certain manufacturer claims that average numbers of candies in a certain sized bag that they produce is 20. To test the claims, you collected a random sample of 10 bags and find the mean is 18 and the standard deviation is 2.7. Assume the numbers of candies are normally distributed. At the significance level \(\alpha=0.05\), does your analysis support the manufacturer's claim?

Solution: Since the population is normally distributed, the sampling distribution for the sample mean is approximately normal.

Set \(H_0: \mu=20\) and \(H_a: \mu\neq 20\).

Since \(H_a\) has the \(\ne\) sign and the population standard deviation is unknown, we use two-tailed \(t\)-test. We will find the \(P\)-value.

The \(T\)-test statistics is \(t=\frac{18.2-20}{2.7/\sqrt{10}}\approx-2.342.\) Using the Excel, we find that the \(P\)-value is \(p\approx\) T.DIST.2T(2.342,9)=0.0439.

Since the \(P\)-value is smaller that the significance level, we reject \(H_0\) which means there is not enough evidence to support the manufacturer's claim at the significance level 0.05.

364 / 377

Example: Test a Mean Using \(P\)-value from a Data Set (1/2)

An instructor would like to know if the students enrolled in a math course in the current semester performed better than students in the last semester. The mean final exam from last semester is 75.5. The final exam scores of 40 randomly selected 40 students were obtained

93 88 69 74 76 81 78 77 74 63 67 81 80 82 68 88 76 69 75 78
75 77 94 87 74 88 63 75 94 88 91 77 76 68 80 88 68 83 72 72

Do the data provide evidence that the students in this semester performed significantly better on the final than last semester?

365 / 377

Example: Test a Mean Using \(P\)-value from a Data Set (1/2)

An instructor would like to know if the students enrolled in a math course in the current semester performed better than students in the last semester. The mean final exam from last semester is 75.5. The final exam scores of 40 randomly selected 40 students were obtained

93 88 69 74 76 81 78 77 74 63 67 81 80 82 68 88 76 69 75 78
75 77 94 87 74 88 63 75 94 88 91 77 76 68 80 88 68 83 72 72

Do the data provide evidence that the students in this semester performed significantly better on the final than last semester?

Solution: The sample size is \(n=40\) which is large enough so that the sampling distribution for the sample mean is approximately normal. We will take the \(P\)-value approach.

Set \(H_0: \mu=75.5\) and \(H_a: \mu>75.5\).

Using Excel functions AVERAGE() and STDEV.S(), we find the sample mean is \(\bar{x}\approx 78.17\) and sample standard deviation is \(s\approx 8.39\).

365 / 377

Example: Test a Mean Using \(P\)-value from a Data Set (2/2)

Solution: (Continued): The \(T\)-test statistic is calculated by $$t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{(78.17-75.5)}{8.39/\sqrt{40}}\approx 2.013.$$

Because \(H_a\) contains the \(>\) sign and \(\sigma\) is unknown, we use the right-tailed \(T\)-test.

The degree of freedom is \(\text{df}=40-1\) The \(P\)-value is the right tail area under the \(T\)-curve, that is T.DIST.R(2.013, 39)=0.0255.

Since the \(P\)-value is less than 5%, at the 5% level of significance, we may reject \(H_0\). So at 5% level of significance, there is enough evidence to support the claim that the students in this semester performed significantly better on the final than last semester.

However, using the 2% level of significance, with the given data, we fail to reject \(H_0\). Then, at the 2% level of significance, there is not enough evidence to support the claim.

366 / 377

Example: Fairness of a Coin

Suppose you want to determine if a coin is fair. You toss the coin 50 times and observe 16 heads and 34 tails. If the coin is fair, the probability of getting 16 heads or less is about 0.008 = 0.8%. At the significant level 0.01, do you think that the coin is fair?

367 / 377

Example: Fairness of a Coin

Suppose you want to determine if a coin is fair. You toss the coin 50 times and observe 16 heads and 34 tails. If the coin is fair, the probability of getting 16 heads or less is about 0.008 = 0.8%. At the significant level 0.01, do you think that the coin is fair?

Solution: Since \(n\hat{p}=16\) and \(n(1-\hat{p})=34\), a \(Z\)-test is valid.

To test if the coin is fair, we set the null hypothesis as \(H_0\): \(p_0=0.5\). The experiment suggests that we should set the alternative hypothesis as \(H_a\): \(p_0<0.5\).

The test statistic is \(\hat{p}=\frac{16}{50}=0.32\) and the standardization is $$z=\dfrac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}=\dfrac{0.32-0.5}{\sqrt{0.5(1-0.5)/50}}=-2.55.$$

From \(H_a\), we know that the test is left-tailed. The \(P\)-value is then \(P=0.008\).

Because the significance level is \(\alpha=0.01\) and \(P=0.008<0.01=\alpha\), we reject the null hypothesis \(H_0\). We conclude that, at the significance level 0.01, there is enough evidence to claim that the coin is unfair.

367 / 377

Draw a normal curve to show the rejection region.

Example: Proportion of Newborns (1/2)

Globally the long-term proportion of newborns who are male is 51.46%. A researcher believes that the proportion of boys at birth changes under severe economic conditions. To test this belief randomly selected birth records of 5,000 babies born during a period of economic recession were examined. It was found in the sample that 52.55% of the newborns were boys. Determine whether there is sufficient evidence, at the 10% level of significance, to support the researcher’s belief.

368 / 377

Example: Proportion of Newborns (1/2)

Globally the long-term proportion of newborns who are male is 51.46%. A researcher believes that the proportion of boys at birth changes under severe economic conditions. To test this belief randomly selected birth records of 5,000 babies born during a period of economic recession were examined. It was found in the sample that 52.55% of the newborns were boys. Determine whether there is sufficient evidence, at the 10% level of significance, to support the researcher’s belief.

Solution: Since \(n\hat{p}\approx 2628\) and \(n(1-\hat{p})\approx 2372\), a \(Z\)-test is valid. we will use the \(P\)-value to test the hypothesis.

To test the researcher's claim, we set the null hypothesis as \(H_0\): \(p_0=0.5146\). The experiment suggests that we should set the alternative hypothesis as \(H_a\): \(p_0\neq 0.5146\).

The standard test statistic is z=p^p0p0(1p0)/n=0.52550.51460.5146(10.5146)/5000=1.5422.

368 / 377

Example: Proportion of Newborns (2/2)

Solution: (Continued) From \(H_a\), we know that the test is two-tailed. The \(P\)-value is then $$P=2*(1-P(Z<0.5255))\approx 0.124.$$

Since the significance level is \(\alpha=0.1\) and the \(P\)-value \(P=0.122>0.1=\alpha\), we fail to reject the null hypothesis \(H_0\).

At the significance level 0.01, there is not enough evidence to support the researcher's belief that the proportion of newborns who are male changes.

369 / 377

A Remark on the SE for Sample Proportion in Hypothesis Testing

In some books, the standard error of the sample distribution of sample proportions assuming that \(p=p_0\) is calculated using the approximation σp^=p^(1p^)n.

An arguable explanation is that using the above value for SE will be consistent with the approach to a hypothesis testing using a confidence interval in the case that a two-tailed test is preformed.

370 / 377

Practice: Testing the Mean Lift Weight (with Known SD)

A college football coach thought that his players could bench press a mean weight of 275 pounds. It is known that the standard deviation is 55 pounds. Three of his players thought that the mean weight was more than that amount. They asked 30 of their teammates for their estimated maximum lift on the bench press exercise. The mean of their maximum lift is 286.2.

Conduct a hypothesis test using a 2.5% level of significance to determine if the bench press mean is more than 275 pounds.

371 / 377

Practice: Testing the Mean Age of Students (with Unknown SD)

In a college report, it says the mean age of students is 23.4 years old. An instructor thinks that the mean age is younger than 23.4. He randomly surveyed 50 students and found that the sample mean is 21.5 and the standard deviation is 1.9. At the significance level \(\alpha=0.025\), is there enough evidence to support the instructor's estimation?

372 / 377

Practice: Testing the Average Household Size

The average household size in a certain region several years ago was 3.14 persons. A sociologist wishes to test, at the 5% level of significance, whether it is different now. Perform the test using the information collected by the sociologist: in a random sample of 75 households, the average size was 2.98 persons, with sample standard deviation 0.82 person.

373 / 377

Practice: Testing the Mean Placement Test Score

The mean score on a 25-point placement exam in mathematics used for the past two years at a large state university is 14.3. The placement coordinator wishes to test whether the mean score on a revised version of the exam differs from 14.3. She gives the revised exam to 30 entering freshmen early in the summer; the mean score is 14.6 with standard deviation 2.4.

  1. Perform the test at the 10% level of significance using the critical value approach.
  2. Compute the observed significance of the test.
  3. Perform the test at the 10% level of significance using the p-value approach.
374 / 377

Practice: Testing the Mean Recovery Time

The average number of days to complete recovery from a particular type of knee operation is 123.7 days. From his experience a physician suspects that use of a topical pain medication might be lengthening the recovery time. He randomly selects the records of seven knee surgery patients who used the topical medication. The times to total recovery were:

128, 135, 121, 142, 126, 151, 123

Assuming a normal distribution of recovery times, perform the relevant test of hypotheses at the 10% level of significance.

Would the decision be the same at the 5% level of significance?

375 / 377

Lab: Excel Functions for Normal Distributions

  • Let \(Z\) be a standard normal random varaible. In Excel, \(P(Z<z)\) is given by NORM.S.DIST(z,TRUE).

  • Let \(X\) be a normal random variable with mean \(\mu\) and standard deviation \(\sigma\), that is \(X\sim \mathcal{N}(\mu, \sigma^2)\). In Excel, \(P(X<x)\) is given by NORM.DIST(x,mean,sd,TRUE).

  • When a cumulative probability \(p=P(X<x)\) of a normal random variable \(X\) is given, we can find \(x\) using NORM.INV(p,mean,sd).

  • When a cumulative probability \(p=P(Z<z)\) of a standard normal random variable \(Z\) is given, we can find \(z\) using NORM.S.INV(p).

376 / 377

Lab: Excel Functions for \(T\)-Distributions

Suppose a Student's \(T\)-distribution has the degree of freedom \(\text{df}=n-1\).

  • To find a probability for a given \(T\)-value

    • The area of the left tail of the \(T\)-value may be calculated by the function T.DIST(t,df,true).

    • The area of the right tail of the \(T\)-value may be calculated by the function T.DIST.R(t, df).

    • The area of two tails of the \(T\)-value (t>0) may be calculated by function T.DIST.2T(t,df).

  • To find the critical value for a given probability \(p\)

    • When the area of the left tail is given, the function T.INV(p,df) may be used.

    • When the area of both tails is given, the function T.INV.2T(p,df) may be used. This function is good for construction confidence interval.

377 / 377
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow