Summarizing Quantitative Data Graphically
+ - 0:00:00
Notes for current slide
Notes for next slide

Summarizing Quantitative Data Graphically

MA336 Statistics

Fei Ye

Department of Mathematics and Computer Science

July 2022

1 / 30

Learning goals

  • Create and interpret graphs (dot plots, pie charts, histograms) as a means of summarizing and communicating data meaningfully.

  • Identify the shape of a distribution (right-skewed, left-skewed, symmetric, or uniform).

2 / 30

Summarizing Quantitative Data Graphically

3 / 30

Distribution of Quantitative Data

  • In data analysis, one goal is to describe patterns (known as the distribution) of the variable in the data set and create a useful summary about the set.

  • To describe patterns in data, we use descriptions of shape, center, and spread. We also describe exceptions to the pattern. We call these exceptions outliers.

Concepts used in the description of a distribution

4 / 30

Dot Plots

  • A dot plot includes all values from the data set, with one dot for each occurrence of an observed value from the set.
5 / 30

Dot Plots

  • A dot plot includes all values from the data set, with one dot for each occurrence of an observed value from the set.

Example: The data set contains 15 petal lengths of iris flower. Create a dot plot to describe the distribution of petal lengths.

1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2

5 / 30

Dot Plots

  • A dot plot includes all values from the data set, with one dot for each occurrence of an observed value from the set.

Example: The data set contains 15 petal lengths of iris flower. Create a dot plot to describe the distribution of petal lengths.

1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2

Solution: For each number in the data set, we draw a dot. We stack dots of the same value from bottom to up.

5 / 30

Practice: Heights Of Cherry Trees

The data set contains the heights of 20 Black Cherry Trees. Create a dot plot to describe the distribution of the heights.

64, 69, 71, 72, 74, 74, 75, 76, 76, 77, 78, 80, 80, 80, 80, 81, 82, 85, 86, 87

6 / 30

Histograms

  • A histogram divides values of a variable into equal-sized intervals called bins (classes in some books) and uses a rectangular bar to show the frequency (count) of observations in each interval.

  • A frequency distribution is a table which contains bins, frequencies and/or relative frequencies which are proportions (percentage) defined by the formula Relative frequency=Class frequencySample size.

  • Each bin has a lower bin limit, which is the left endpoint of the interval, and an upper bin limit, which is the right endpoint of the interval.

  • The bin width is the distance between the lower (or upper) bin limits of two consecutive bins.

  • The difference between the maximum and the minimum data entries is called the range.

  • The midpoint of a bin is the half of the sum of the lower and upper limits of the bin.

7 / 30

Dot plots work well with small data sets. Because, each data entry is a bin that contains all entries with the same value.

Example: Histogram of mpg (1 of 2)

The following data set show the mpg (mile per gallon) of 30 cars. Construct a frequency table and frequency histogram for the data set using \(7\) bins.

21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7

8 / 30

Example: Histogram of mpg (1 of 2)

The following data set show the mpg (mile per gallon) of 30 cars. Construct a frequency table and frequency histogram for the data set using \(7\) bins.

21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7

Solution:

  • Find the maximum, minimum and range of the data set. In this example, the minimum is \(10.4\), the maximum is \(33.9\), and the range is \(33.9-10.4=23.5\).

  • Find the bin width which can be taken as a number between \(\frac{\mathrm{range}}{k}\) and \(\frac{\mathrm{range}}{k-1}\), where . In this case, since \(\frac{23.5}{7}\approx 3.357\) and \(\frac{23.5}{7-1}\approx 3.917\), we can take the bin width as \(3.4\).

  • Choose a convenient starting point as the first lower bin limit. The point can be a value less than or equal to the minimum so that no data value equal a bin limit other than the first lower bin limit and the last upper bin limit. For example, in this data set, we may start with 10.35, then add the bin width to get all lower bin limits: 10.35, 13.75, 17.15, 20.55, 23.95, 27.35, 30.75.

8 / 30

Example: Histogram of mpg (2 of 2)

Solution:(continued)

  • The upper bin limit can be taken as the next lower bin limit. In this example, the upper bin limits can be taken as 13.75, 17.15, 20.55, 23.95, 27.35, 30.75, 34.15.

  • Record counts in bins and create the frequency distribution table.

  • Graph the histogram using the frequency distribution table.

Bin Frequency
10.35-13.75 3
13.75-17.15 7
17.15-20.55 7
20.55-23.95 6
23.95-27.35 3
27.35-30.75 2
30.75-34.15 2

9 / 30

Some Remarks on Histogram

  • Avoid histograms with large bin widths and small bin widths. See Histogram 2 of 4 in Concepts in Statistics for an interactive demonstration

  • When bin width is no given, we may first determine the number of bins. If the number of bins is \(k\), then we choose a number with the same or one more decimal place that is greater than \(\frac{\text{range}}{k}\), but no more than \(\frac{\text{range}}{k-1}\) as the bin width. This is to avoid that the last bin limit is too much bigger than the max.

    To determine the number of bins, there are some "rules of thumb". For example, the Rice rule takes the bin number \(k = \lceil 2n^{1/3}\rceil\), where \(\lceil 2n^{1/3}\rceil\) is the roundup of \(2n^{1/3}\).

    See the Statistic How To page for more discussion on choosing bin width.

  • The convenient starting point should not be too much smaller than the min. The starting points together with the bin width affects the shape of the histogram. It'd be better to experiment with different choices of the starting point and the bin width.

  • The area of a bar represents the relative frequency for the bin. There should be no space between any two bars.

10 / 30

Practice: Petal lengths of irises

The following data set show the petal length of 20 irises. Construct a frequency table and frequency histogram for the data set using 6 bins.

1.4, 5.4, 1.2, 4.5, 6.1, 1.5, 4.7, 1.4, 5.6, 5.2, 1.3, 6.3, 5.1, 5.6, 5, 6.7, 1.4, 1.6, 1.5, 1.5

11 / 30

Common Descriptions of Shape Distribution

  • Right skewed (or reverse \(J\)-shaped): A right-skewed distribution has a lot of data at lower variable values. (Example: the histogram example.)

  • Left skewed (or \(J\)-shaped): A left skewed distribution has a lot of data at higher variable values with smaller amounts of data at lower variable values.

  • Symmetric with a central peak (or bell-shaped): A central peak with a tail in both directions. A bell-shaped distribution has a lot of data in the center with smaller amounts of data tapering off in each direction. (Example: the petal length example.)

  • Uniform: A rectangular shape, the same amount of data for each variable value.

  • For examples of left skewed and uniform distributions, please see the example in dotpolt 2 of 2 in Concepts in Statistics

12 / 30

Practice: Shapes of Distributions

Statistics are used to compare and sometimes identify authors. The following lists shows a simple random sample that compares the letter counts for three authors.

Terry: 7, 9, 3, 3, 3, 4, 1, 3, 2, 2

Davis: 3, 4, 4, 4, 1, 4, 5, 2, 3, 1

Maris: 2, 3, 4, 4, 4, 6, 6, 6, 8, 3

Create a dot plot for each sample and describe the shape of the distribution of each sample.

13 / 30

Measure of Centers

  • Mean: The mean is the average, this is the quotient of the total sum by the total number.

  • Median: The median is a value that separate the data into the lower half and the upper half. To calculate the median, sort the data first. If the number of data values is odd, the median is the middle value. Otherwise, the median is the mean of the middle two values.

  • Mode: The mode is the value that has the most occurrence in the data set.

  • Use the mean as a measure of center only for distributions that are reasonably symmetric with a central peak. When outliers are present, the mean is not a good choice.

  • Use the median as a measure of center for all other cases.

  • We need to use a graph to determine the shape of the distribution. So graph the data first.

14 / 30

Mean and Median for Distributions in Different Shapes

15 / 30

Practice: Choose appropriate measure of center.

A student survey was conducted at a major university. The following histogram shows distribution of alcoholic beverages consumed in a typical week.

  1. What is the typical number of drinks a student has during a week?
  2. Do the data suggest that drinking is a problem in this university?

    The red line is over the median and the blue line is over the mean.
16 / 30

Summarizing Categorical Data Graphically

17 / 30

Pie Charts

  • A pie chart is a pie with sectors represents categories and the area of each sector is proportional to the frequency of each category.
    • The frequency of a category is the number of occurrences of elements in the category.
    • The proportion of a frequency to the size of the population or the sample is also called the relative frequency.
18 / 30

Pie Charts

  • A pie chart is a pie with sectors represents categories and the area of each sector is proportional to the frequency of each category.
    • The frequency of a category is the number of occurrences of elements in the category.
    • The proportion of a frequency to the size of the population or the sample is also called the relative frequency.

Example: The counts of majors of 100 students in a sample are shown in the table. Use a pie chart to organize the data.

Grade Frequency (Counts)
Art 30
Engineering 50
Science 20
18 / 30

Pie Charts (2/2)

Solution:

  • Find the relative frequency (percent) of each grade is shown in the following table.
Major Frequency Relative Frequency
Art 30 30%
Engineering 50 50%
Science 20 20%
Total 100 100%
  • The following shows the pie chart.

    A pie chart

19 / 30

Practice: Passengers on Titanic

The following data table summarize passengers on Titanic. Using a pie chart to describe the data table.

Class Passengers
1st 325
2nd 285
3rd 706
Crew 885
20 / 30

Lab Instruction in Excel

21 / 30

Create Frequency Tables (1 of 2)

In Excel, to create a frequency table for a data array, we need a bin array which is used to split the date set into smaller intervals. The values in a bin array in Excel are (upper) boundaries of intervals. With a data array and a bin array, we can use the Excel function FREQUENCY (data_array, bins_array) to create a frequency table.

Suppose the data set is in column A and the bin array is in column B. Here is how to create a frequency table using the function FREQUENCY (data_array, bins_array):

  1. In column C, right to the smallest value of the bin array enter =FREQUENCY(
  2. select the data values
  3. in the formula bar, enter the symbol comma ,
  4. select the bin array
  5. in the formula bar, enter ).

Hit the Enter, you will get a frequency table.

22 / 30

Create Frequency Tables (2 of 2)

Remark 1: In this formula, the values in a bin array should be first \(k-1\) upper class limits (or the last \(k-1\) lower class limits), where \(k\) is the number of bins. In Excel, if the bin array consists of 30, 40, and 50, then the bins will be \((-\infty,30]\), \((30,40]\), \((40, 50]\), \((50, \infty)\).

Remark 2: In older version of Excel, you may have to highlight cells for frequencies first, enter the FREQUENCY function secondly, and then hit Ctrl+Shift+Enter (or Cmd+Shift+Enter on Mac).

23 / 30

Creating Charts in Excel

Excel has many built-in chart functions. To create a charts,

  1. Select the data array/table
  2. Under the Insert tab, click on an appropriate chart in the Charts command set.

The appearance of chart can be changed after being created.

24 / 30

Create Histogram Charts in Excel (1 of 3)

  1. Select the data

  2. On the Insert tab, in the Charts group, from the Insert Statistic Chart dropdown list, select Histogram:

    Note: The histogram contains a special first bin which always contains the smallest number. This is different from many textbooks.

25 / 30

Create Histogram Charts in Excel (2 of 3)

To format the histogram chart is similar to format a Pie chart. For example, you can change bin width from Format Axis.

  1. Right-click on the horizontal axis and choose Format Axis in the popup menu:

  2. In the Format Axis pane, on the Axis Options tab, you may try different options for bins.

26 / 30

Create Histogram Charts in Excel (3 of 3)

Remark:

  • Excel using a different convention to create histogram. The first bin is a closed interval and other bins are left open and right closed intervals.

  • Select the Overflow bin checkbox and type the number, all values above this number will be added to the last bin.

  • Select the Underflow bin checkbox and type the number, all values below and equal to this number will be added to the first bin.

  • Histograms show the shape and the spread of quantitative data. For categorical data, discrete by its definition, bar charts are usually used to represent category frequencies.
27 / 30

Create Histogram Charts in Excel using the Analysis ToolPak

Suppose your data set is in Column A in Excel.

  • In the cell B1, put the first lower bin limit, which is a number slightly less than the minimum but has more decimal places than the data set.

  • Create upper bin limits in column C.

  • In Data menu, look for the Data Analysis ToolPak (if not, go to File > Options > Add-ins > Manage Excel Add-ins, check Analysis ToolPak). In the popup windows, find Histogram.

  • In the input range, select your data set. In the bin range, select upper bins.

  • Check Chart Output and hit OK. You will see the frequency table and histogram in Sheet 2.

  • Change the gap between bars. Right click a bar and choose Format Data Series... and change the Gap Width to 2% or 1%.

28 / 30

How to Create a Dotplot in Excel

  • If you have a raw data set, follow the same procedure a creating a histogram but with a bin width equal the same accuracy of the data. For example, if you data set consists of integers, then choose 1 as the bin-width.

  • Change the format of bars in the histogram.

    • Right click a bar and select Format Data Series....

    • Find Fill & Line and select both Picture or texture fill and Stack and Scale with.

    • Click the button Oneline... and input dot in search bing and hit enter.

    • Select a picture you like and you will get a dot-plot.

29 / 30

Lab Practice

Use Excel to complete the following tasks:

  1. Create a random sample of 30 two-digit integers.

  2. Create a histogram with 6 bins for the sample.

  3. Describe the shape of the distribution of the sample of 30 two-digit integers.

30 / 30

Learning goals

  • Create and interpret graphs (dot plots, pie charts, histograms) as a means of summarizing and communicating data meaningfully.

  • Identify the shape of a distribution (right-skewed, left-skewed, symmetric, or uniform).

2 / 30
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Alt + fFit Slides to Screen
Esc Back to slideshow