Fei Ye's Website

Measure of Centeral Tendency and Variability

class: center, middle, inverse, title-slide

.title[
# Measure of Centeral Tendency and Variability
]
.subtitle[
## MA336 Statistics<br /><br />
]
.author[
### Fei Ye <br /><br /> Department of Mathematics and Computer Science<br /><br />
]
.date[
### July 2022
]

---

## Learning goals

- Create and interpret boxplots as a means of summarizing non-symmetric data.

- Calculate and explain the purpose of measures of centers (mean, median), variability (standard deviation, interquartile range).

- Explain the impact of outliers on summary statistics such as mean, median and standard deviation.

---
class: center middle

# Quatiles and Boxplot

---

## Median, Quartiles, Interquartile Range and Outliers

- The three **quartiles**, `$Q_1$`, `$Q_2$`, and `$Q_3$` are numbers in an ordered data set that divide the data set into four equal parts. The second quartile is known as the **median**.

- **Interquartile Range (IQR for short)** is the measure of variation when using the median to measure center. It is defined as the difference of the third and the first quartiles: `$\text{IQR}=Q_3-Q_1$`.

- When the center and the spread are measured by the median and the IQR, a value in the data is considered an **outlier** if the value is
  - less than the lower fence `$\text{fence}_{lower}=Q_1 − 1.5 \cdot \text{IQR}$`
   or
  - greater than the upper fence `$\text{fence}_{upper}=Q_3 + 1.5 \cdot \text{IQR}$`.

**Note:** An outlier in this definition is also called a **mild outlier**. An outlier that is less than the extreme lower fence `$\text{extreme fence}_{lower}=Q_1 - 3 \cdot \text{IQR}$` or greater than the extreme upper fence `$\text{extreme fence}_{upper}=Q_3 + 3 \cdot \text{IQR}$` is also called **extreme outlier**.

- The minimum, `$Q_1$`, `$Q_2$`, `$Q_3$` and maximum are known as the "**five-number summary**" of the data set.

- The difference of maximum and minimum is called the **range**.
  
---

## Example: Median, IQR and Outliers

Find the median, quartiles, IQR and outliers (if they exist) of the sample height of 15 trees.

.center[
70, 65, 63, 72, 81, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75
]

**Solution:**

- Sort the data set from small to large.
.center[
63, 65, 66, 69, 70, 72, 75, 75, 75, 76, 76, 79, 80, 81, 83
]
- Find the median `$Q_2$`. The sample size is 15. The middle of the ordered data set is the `$\lceil 15/2 \rceil=8$`-th number which is 75.
- Find `$Q_1$` and `$Q_3$`. `$Q_1$` is the median of the numbers less than the median. `$Q_3$` is the median of the number greater than the median. In this example, `$Q_1$` is the 4-th number 69. `$Q_3$` is the 4-th to the last, that is 79.
- `$\text{IQR}=Q_3-Q_1=79-69=10$`.
- Since `$Q_1-1.5\text{IQR}=69-1.5\cdot 10=54$` and `$Q_3+1.5\text{IQR}=79-1.5 \cdot 10=94$`, there is no outlier in this sample.

---

## Practice: Five-number Summary, Range and IQR

.embedwrap[
<iframe src="https://www.myopenmath.com/embedq2.php?id=899254&amp;seed=2020&amp;showansafter" width="100%" height="550px" data-external="1"></iframe>
]

---

## Box Plot

- A **box plot** shows a "five-number summary" of the data set. It contains a box, two whiskers and dots (for outliers).

- To create the boxplot for a distribution,
  
  - Draw a box from `$Q_1$` to `$Q_3$`.
  
  - Draw a vertical line in the box at the median.
  
  - Extend a tail from `$Q_1$` to the smallest value that is not an outlier and from `$Q_3$` to the largest value that is not an outlier.
  
  - Indicate outliers with a solid dot.

---

## Example: Box plot - ages of best oscar winners (1/2)

Create the boxplot for the ages of 32 best actor oscar winners (1970–2001).

.center[
31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76
]

**Solution:** We may use Excel to find the five-number summary.

- The quartiles are 
  `$$Q_2=42.5,\quad Q_1=37.5,\quad Q_3=49.5.$$`
  The interquartile range and the bounds for mild outliers are
  `$$\text{IQR}=12, \quad Q_1-1.5\text{IQR}= 19.5, \quad Q_3+1.5\text{IQR}=67.5.$$`

- The smallest number that is not an outlier is 31. The largest number that is not an outlier is 61. Those two numbers bound the whiskers.

- The number 76 is a mild outlier because
  `$$Q_3+1.5\text{IQR}< 76 < Q_3+3\text{IQR}.$$`

---

## Example: Box plot - ages of best oscar winners (2/2)

**Solution: (continued)**

- The boxplot is shown below.

.center[
<div id="htmlwidget-04b98be623b9e76ade87" style="width:864px;height:345.6px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-04b98be623b9e76ade87">{"x":{"data":[{"x":[43,40,48,48,56,38,60,32,40,42,37,76,39,55,45,35,61,33,51,32,43,55,42,37,38,31,45,60,46,40,36,47],"y":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],"hoverinfo":"x","type":"box","fillcolor":"rgba(255,255,255,1)","marker":{"opacity":null,"outliercolor":"rgba(0,0,0,1)","line":{"width":1.88976377952756,"color":"rgba(0,0,0,1)"},"size":5.66929133858268},"line":{"color":"rgba(51,51,51,1)","width":1.88976377952756},"showlegend":false,"xaxis":"x","yaxis":"y","orientation":"h","frame":null}],"layout":{"margin":{"t":42.5747297129253,"r":10.6268161062682,"b":70.1446757486248,"l":15.9402241594022},"plot_bgcolor":"rgba(255,255,255,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":21.2536322125363},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[28.75,78.25],"tickmode":"array","ticktext":["31.0","37.5","42.5","49.5","61.0","76.0"],"tickvals":[31,37.5,42.5,49.5,61,76],"categoryorder":"array","categoryarray":["31.0","37.5","42.5","49.5","61.0","76.0"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":5.31340805313408,"tickwidth":0.966074191478924,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":17.0029057700291},"tickangle":-0,"showline":true,"linecolor":"rgba(0,0,0,1)","linewidth":1.59402241594022,"showgrid":false,"gridcolor":null,"gridwidth":0,"zeroline":false,"anchor":"y","title":{"text":"Age","font":{"color":"rgba(0,0,0,1)","family":"","size":21.2536322125363}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.4,1.6],"tickmode":"array","ticktext":[""],"tickvals":[1],"categoryorder":"array","categoryarray":[""],"nticks":null,"ticks":"","tickcolor":null,"ticklen":5.31340805313408,"tickwidth":0,"showticklabels":false,"tickfont":{"color":null,"family":null,"size":0},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":false,"gridcolor":null,"gridwidth":0,"zeroline":false,"anchor":"x","title":{"text":"","font":{"color":"rgba(0,0,0,1)","family":"","size":21.2536322125363}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":2.74874731567645,"font":{"color":"rgba(0,0,0,1)","family":"","size":17.0029057700291}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"source":"A","attrs":{"46a8cae28be":{"x":{},"y":{},"type":"box"}},"cur_data":"46a8cae28be","visdat":{"46a8cae28be":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
]

---

## Practice: five-number summary from the boxplot

---
class: center middle

# Mean and Standard Deviation

---

## Notations and Calculations about Mean

- Sigma notation: in math, we denote the sum of values  `$x_1$`, `$x_2$`, `$\dots$`, `$x_n$` of a variable `$x$` by `$\sum\limits_{i=1}^n x_i$` or simply by `$\sum x$`.

- The **population mean** is `$\mu= \frac{\sum x}{N}$`, where `$N$` is the **population size**, i.e the number of elements in the population.

The notation `$\mu$` reads as mu.

- The **sample mean** is `$\bar{x}=\frac{\sum{x}}{n}$`, where `$n$` is the **sample size**. The notation `$\bar{x}$` reads as `$x$`--bar.

---

## Example: Mean city mpg

Find the mean city mpg for a sample of 10 cars.
.center[
18, 21, 20, 21, 16, 18, 18, 18, 16, 20
]

**Solution:** The mean is

`$$\bar{x}=\frac{18+21+20+21+16+18+18+18+16+20}{10}=18.6.$$`

The mean mpg of the 10 cars is 18.6 mpg.

---

## Weighted Mean

- The weighted mean of a set of numbers `$\{x_1, \dots, x_n\}$` with weights `$w_1$`, `$w_2$`, ..., `$w_n$` is defined as  `$$\frac{\sum w_ix_i}{\sum w_i}.$$`

- The mean of a frequency table is weighted mean `$\bar{x}=\frac{\sum f x}{n}$`, where `$x$` is an element with frequency `$f$` and `$n$` is the sample size.

---

## Example: Course overall grade

In a course, the overall grade is determined in the following way: the homework average counts for 10%, the quiz average counts for 10%, the test average counts 50% , and the final exam counts for 30%. What's the overall grade of the student who earned  92 on homework, 95 on quizzes, 90 on tests and 93 on the final.

**Solution:** The overall grade is the weighted mean

`$$\frac{\sum w_ix_i}{\sum w_i}=\frac{0.1\cdot 92+0.1\cdot 95+0.5\cdot 90+0.3\cdot 93}{0.1+0.1+0.5+0.3}=91.6.$$`

???
Show how to use Excel

---

## Practice: Mean petal width

Find the average petal width for a sample of  10 iris followers.
  
.center[
0.2, 2.1, 0.2, 1.7, 2.3, 0.3, 1.2, 0.2, 1.8, 2.3
]

---

## Practice: Calculate a mean using the weighted mean formula

Find the mean from the dot plot of sepal length for a sample of 10 iris flowers.
.center[
<img src="data:image/png;base64,#MA336-Week-3-Measure-Center-Spread_files/figure-html/unnamed-chunk-8-1.png" width="576" />
]

<!-- 
Practice: Estimate the mean from a histogram

Estimate the average highway mpg using the histogram of a sample of 20 cars.

.center[
<img src="data:image/png;base64,#MA336-Week-3-Measure-Center-Spread_files/figure-html/unnamed-chunk-9-1.png" width="576" />
] 
-->

---

## Practice: Weighted mean - calculate final grade

---

## Measure of Variation about Population Mean

- The **deviation** of an entry `$x$` in a population data set is the difference `$x-\mu$`, where `$\mu$` is the mean of the population.
  
- The **population variance** of a population of `$N$` entries is defined as
  $$
    \text{VAR.P}=\sigma^2=\dfrac{\sum(x-\mu)^2}{N}.
  $$

- The **population standard deviation** is
  $$
    \text{STDEV.P}=\sigma=\sqrt{\dfrac{\sum(x-\mu)^2}{N}}.
  $$

---

## Measure of Variation about Sample Mean

- The **deviation** of an entry `$x$` in a sample data set is the difference `$x-\bar{x}$`, where `$\bar{x}$` is the mean of the sample.

- The **sample variance** and **sample standard deviation** are defined similarly
  $$
    \text{VAR.S}=s^2=\dfrac{\sum(x-\bar{x})^2}{n-1}, \qquad
    \text{STDEV.S}=s=\sqrt{\dfrac{\sum(x-\bar{x})^2}{n-1}},
  $$
  where `$n$` is the sample size.

- **Rounding rule:** for mean, variance and standard deviation, we keep at least one more digit than the accuracy of the data set.

**Note:** To measure the spread, one may also use the **mean absolute deviation**
`$$MAD=\dfrac{\sum |x-\bar{x}|}{n}.$$`
However, the standard deviation has better properties in applications.

???
Show how to use Excel to find SD

---

## Example: Standard deviation - ages of oscar winners

Find the mean and standard deviation ages of a sample of  32 best actor oscar winners (1970–2001).

.center[
31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76
]

**Solution:** We use the Excel functions `AVERAGE()` and `STDEV.S()` to find the mean and sample standard deviation respectively.
The mean is 44.7. The sample standard deviation is 10.3.

.footmark[
Source: https://www.geogebra.org/m/DS6PUaXy
]

---

## Practice: Standard deviation

A *sample* of GPAs from ten students random chosen from a college are recorded as follows.
.center[1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.33]

Find the standard deviation of this sample.

---

## Mean and Standard Deivation under Linear Transformation

- When we increase values in a data set by a fixed number `$c$`, the standard deviation of a data set won't change. However, the mean increases by `$c$` too.

- When we multiple values in a data set by a factor `$k$`, the mean and the standard deviation both scale by the factor `$k$`.

.footmark[
Source: https://www.geogebra.org/m/r25rDxYZ
]
---

## Effect of Changes of Data on Statistical Measures

.footmark[
Source: https://www.geogebra.org/m/fenbj3qZ
]

---

## Practice: Standard deviation under a transformation

A sample of the highest temperature of 10 days has a standard deviation `$5^\circ\mathrm{C}$` in Celsius.

1. If we want to know the standard deviation in Fahrenheit, do we need to recalculate using the sample?

2. What is the standard deviation in Fahrenheit.
  
---

## The Empirical Rule

If a data set has an **approximately bell-shaped** distribution, then

1. approximately 68% of the data lie within one standard deviation of the mean.

2. approximately 95% of the data lie within two standard deviations of the mean.

3. approximately 99.7% of the data lies within three standard deviations of the mean.

.center[
![:resize Empirical Rule, 35%](data:image/png;base64,#Figures/Empirical-Rule.jpg)
]

.footmark[
Image source: [Figure 2.16 "The Empirical Rule"  in Introductoray Statistics](https://saylordotorg.github.io/text_introductory-statistics/s06-05-the-empirical-rule-and-chebysh.html#fwk-shafer-ch02_s05_s01_f02)
]

---

## Chebyshev’s Theorem

For any numerical data set, at least `$1−1/k^2$`
of the data lie within `$k$` standard deviations of the mean, where `$k$` is any positive whole number that is at least 2.

.center[
![:resize Empirical Rule, 45%](data:image/png;base64,#Figures/Chebyshev.jpg)
]

.footmark[
Image source: [Figure 2.19 "Chebyshev’s Theorem"  in Introductoray Statistics](https://saylordotorg.github.io/text_introductory-statistics/s06-05-the-empirical-rule-and-chebysh.html#fwk-shafer-ch02_s05_s02_f01)
]

---

## Example: Applications of the Empirical Rule

A population data set with a bell-shaped distribution has mean `$\mu = 6$` and standard deviation `$\sigma = 2$`. Find the approximate proportion of observations in the data set that lie:

1. between 4 and 8;
2. below 4.

**Solution:** Apply the Empirical Rule, there are 68% of data lie between 6-2=4 and 6+2=8. Since the distribution is symmetric, then 34% of data lie between 4 and 6, and 34% of data lie between 6 and 8. Then there are only 50%-34%=26% of data lie below 4.

---

## Example: Applications of Chebyshev's Theorem

A sample data set has mean `$\bar{x}=6$`
and standard deviation `$s = 2$`. Find the minimum proportion of observations in the data set that must lie
between 2 and 10.

**Solution:** Apply Chebyshev's theorem, there are 75% of data are between `$\bar{x}-2s=2$` amd `$\bar{x}+2s=10$`.

---

## Practice: The empirical rule

---

## Practice: Chebyshev’s Theorem

A sample data set has mean `$\bar{x}=10$` and standard deviation `$s = 3$`. Find the minimum proportion of observations in the data set that must lie between 1 and 19.

.footmark[
    Source: [2.5 The Empirical Rule and Chebyshev’s Theorem in Introductory Statistics](https://saylordotorg.github.io/text_introductory-statistics/s06-05-the-empirical-rule-and-chebysh.html#fwk-shafer-ch02_s05_s01_f02).
]

---

class: center middle

# More Practice

---

## Practice: Change of Measures on Transformation of Data

A teacher decide to curve the final exam by adding 10 points for each student. Which of
the following statistic will NOT change:  
A. median,   B. mean,   C. interquartile range,   D. standard deviation?  
**Please explain your conclusion.**

---

## Practice: Understand Standard Deviation From Graphs

Which distribution of data has the SMALLEST standard deviation? Please explain your conclusion.

.center[
![Distributions with different standard deviation](data:image/png;base64,#Figures/SD-Pic.png)
]

---
class: center, middle

# Lab Instruction in Excel

---

## Mean, Median, Quartiles and Standard Deviation

- To find the median, you may use the function `MEDIAN()`.

- To find quartiles, you may use the function `QUARTILE.EXC()`.
  
  **Note:** this function calculates first and third quartiles with 25% and 75% weights. The results may be different from the results calculated by hand discussed in this course.

- To find the mean, you may use the function `AVERAGE()`.

- To find the **population** standard deviation, you may use the function `STDEV.P()`.

- To find the **sample** standard deviation, you may use the function `STDEV.S()`.

---

## How to Create a Boxplot in Excel

- Select your data—either a single data series, or multiple data series.

- Click `Insert` > `Insert Statistic Chart` > `Box and Whisker` to create a boxplot.
  
For more information, see [Create a box and whisker chart in Excel 365](https://support.microsoft.com/en-us/office/create-a-box-and-whisker-chart-62f4219f-db4b-4754-aca8-4743f6190f0d)

---

## Lab Practice

Consider the following sample that consists of speeds of 20 cars.
.center[
19, 4, 17, 22, 23, 8, 20, 19, 10, 10, 13, 13, 15, 12, 20, 14, 9, 20, 12, 11
]

1. Use Excel to find the mean, median, quartiles and standard deviation of the sample.
2. Create a box-plot for the sample.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
Alt + f	Fit Slides to Screen

Measure of Centeral Tendency and Variability

MA336 Statistics

Fei Ye Department of Mathematics and Computer Science

July 2022

Learning goals

Quatiles and Boxplot

Median, Quartiles, Interquartile Range and Outliers

Example: Median, IQR and Outliers

Example: Median, IQR and Outliers

Practice: Five-number Summary, Range and IQR

Box Plot

Example: Box plot - ages of best oscar winners (1/2)

Example: Box plot - ages of best oscar winners (1/2)

Example: Box plot - ages of best oscar winners (2/2)

Practice: five-number summary from the boxplot

Mean and Standard Deviation

Notations and Calculations about Mean

Example: Mean city mpg

Example: Mean city mpg

Weighted Mean

Example: Course overall grade

Example: Course overall grade

Practice: Mean petal width

Practice: Calculate a mean using the weighted mean formula

Practice: Weighted mean - calculate final grade

Measure of Variation about Population Mean

Measure of Variation about Sample Mean

Example: Standard deviation - ages of oscar winners

Example: Standard deviation - ages of oscar winners

Practice: Standard deviation

Mean and Standard Deivation under Linear Transformation

Effect of Changes of Data on Statistical Measures

Practice: Standard deviation under a transformation

The Empirical Rule

Chebyshev’s Theorem

Example: Applications of the Empirical Rule

Example: Applications of the Empirical Rule

Example: Applications of Chebyshev's Theorem

Practice: The empirical rule

Practice: Chebyshev’s Theorem

More Practice

Practice: Change of Measures on Transformation of Data

Practice: Understand Standard Deviation From Graphs

Lab Instruction in Excel

Mean, Median, Quartiles and Standard Deviation

How to Create a Boxplot in Excel

Lab Practice

Learning goals

Help

Fei Ye

Department of Mathematics and Computer Science