Create and interpret graphs (dot plots, pie charts, histograms) as a means of summarizing and communicating data meaningfully.
Identify the shape of a distribution (right-skewed, left-skewed, symmetric, or uniform).
In data analysis, one goal is to describe patterns (known as the distribution) of the variable in the data set and create a useful summary about the set.
To describe patterns in data, we use descriptions of shape, center, and spread. We also describe exceptions to the pattern. We call these exceptions outliers.
Example: The data set contains 15 petal lengths of iris flower. Create a dot plot to describe the distribution of petal lengths.
1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2
Example: The data set contains 15 petal lengths of iris flower. Create a dot plot to describe the distribution of petal lengths.
1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2
Solution: For each number in the data set, we draw a dot. We stack dots of the same value from bottom to up.
The data set contains the heights of 20 Black Cherry Trees. Create a dot plot to describe the distribution of the heights.
64, 69, 71, 72, 74, 74, 75, 76, 76, 77, 78, 80, 80, 80, 80, 81, 82, 85, 86, 87
A histogram divides values of a variable into equal-sized intervals called bins (classes in some books) and uses a rectangular bar to show the frequency (count) of observations in each interval.
A frequency distribution is a table which contains bins, frequencies and/or relative frequencies which are proportions (percentage) defined by the formula
Each bin has a lower bin limit, which is the left endpoint of the interval, and an upper bin limit, which is the right endpoint of the interval.
The bin width is the distance between the lower (or upper) bin limits of two consecutive bins.
The difference between the maximum and the minimum data entries is called the range.
The midpoint of a bin is the half of the sum of the lower and upper limits of the bin.
Dot plots work well with small data sets. Because, each data entry is a bin that contains all entries with the same value.
The following data set show the mpg (mile per gallon) of \(7\)
bins.
21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7
The following data set show the mpg (mile per gallon) of \(7\)
bins.
21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7
Solution:
Find the maximum, minimum and range of the data set. In this example, the minimum is \(10.4\)
, the maximum is \(33.9\)
, and the range is \(33.9-10.4=23.5\)
.
Find the bin width which can be taken as a number between \(\frac{\mathrm{range}}{k}\)
and \(\frac{\mathrm{range}}{k-1}\)
, where . In this case, since \(\frac{23.5}{7}\approx 3.357\)
and \(\frac{23.5}{7-1}\approx 3.917\)
, we can take the bin width as \(3.4\)
.
Choose a convenient starting point as the first lower bin limit. The point can be a value less than or equal to the minimum so that no data value equal a bin limit other than the first lower bin limit and the last upper bin limit. For example, in this data set, we may start with 10.35, then add the bin width to get all lower bin limits: 10.35, 13.75, 17.15, 20.55, 23.95, 27.35, 30.75.
Solution:(continued)
The upper bin limit can be taken as the next lower bin limit. In this example, the upper bin limits can be taken as 13.75, 17.15, 20.55, 23.95, 27.35, 30.75, 34.15.
Record counts in bins and create the frequency distribution table.
Graph the histogram using the frequency distribution table.
Bin | Frequency |
---|---|
10.35-13.75 | 3 |
13.75-17.15 | 7 |
17.15-20.55 | 7 |
20.55-23.95 | 6 |
23.95-27.35 | 3 |
27.35-30.75 | 2 |
30.75-34.15 | 2 |
Avoid histograms with large bin widths and small bin widths. See Histogram 2 of 4 in Concepts in Statistics for an interactive demonstration
When bin width is no given, we may first determine the number of bins. If the number of bins is \(k\)
, then we choose a number with the same or one more decimal place that is greater than \(\frac{\text{range}}{k}\)
, but no more than \(\frac{\text{range}}{k-1}\)
as the bin width. This is to avoid that the last bin limit is too much bigger than the max.
To determine the number of bins, there are some "rules of thumb". For example, the Rice rule takes the bin number \(k = \lceil 2n^{1/3}\rceil\)
, where \(\lceil 2n^{1/3}\rceil\)
is the roundup of \(2n^{1/3}\)
.
See the Statistic How To page for more discussion on choosing bin width.
The convenient starting point should not be too much smaller than the min. The starting points together with the bin width affects the shape of the histogram. It'd be better to experiment with different choices of the starting point and the bin width.
The area of a bar represents the relative frequency for the bin. There should be no space between any two bars.
The following data set show the petal length of 20 irises. Construct a frequency table and frequency histogram for the data set using 6 bins.
1.4, 5.4, 1.2, 4.5, 6.1, 1.5, 4.7, 1.4, 5.6, 5.2, 1.3, 6.3, 5.1, 5.6, 5, 6.7, 1.4, 1.6, 1.5, 1.5
Right skewed (or reverse \(J\)
-shaped): A right-skewed distribution has a lot of data at lower variable values. (Example: the histogram example.)
Left skewed (or \(J\)
-shaped): A left skewed distribution has a lot of data at higher variable values with smaller amounts of data at lower variable values.
Symmetric with a central peak (or bell-shaped): A central peak with a tail in both directions. A bell-shaped distribution has a lot of data in the center with smaller amounts of data tapering off in each direction. (Example: the petal length example.)
Uniform: A rectangular shape, the same amount of data for each variable value.
For examples of left skewed and uniform distributions, please see the example in dotpolt 2 of 2 in Concepts in Statistics
Statistics are used to compare and sometimes identify authors. The following lists shows a simple random sample that compares the letter counts for three authors.
Terry: 7, 9, 3, 3, 3, 4, 1, 3, 2, 2
Davis: 3, 4, 4, 4, 1, 4, 5, 2, 3, 1
Maris: 2, 3, 4, 4, 4, 6, 6, 6, 8, 3
Create a dot plot for each sample and describe the shape of the distribution of each sample.
Mean: The mean is the average, this is the quotient of the total sum by the total number.
Median: The median is a value that separate the data into the lower half and the upper half. To calculate the median, sort the data first. If the number of data values is odd, the median is the middle value. Otherwise, the median is the mean of the middle two values.
Mode: The mode is the value that has the most occurrence in the data set.
Use the mean as a measure of center only for distributions that are reasonably symmetric with a central peak. When outliers are present, the mean is not a good choice.
Use the median as a measure of center for all other cases.
We need to use a graph to determine the shape of the distribution. So graph the data first.
A student survey was conducted at a major university. The following histogram shows distribution of alcoholic beverages consumed in a typical week.
Example: The counts of majors of 100 students in a sample are shown in the table. Use a pie chart to organize the data.
Grade | Frequency (Counts) |
---|---|
Art | 30 |
Engineering | 50 |
Science | 20 |
Solution:
Major | Frequency | Relative Frequency |
---|---|---|
Art | 30 | 30% |
Engineering | 50 | 50% |
Science | 20 | 20% |
Total | 100 | 100% |
The following data table summarize passengers on Titanic. Using a pie chart to describe the data table.
Class | Passengers |
---|---|
1st | 325 |
2nd | 285 |
3rd | 706 |
Crew | 885 |
In Excel, to create a frequency table for a data array, we need a bin array which is used to split the date set into smaller intervals. The values in a bin array in Excel are (upper) boundaries of intervals. With a data array and a bin array, we can use the Excel function FREQUENCY (data_array, bins_array)
to create a frequency table.
Suppose the data set is in column A and the bin array is in column B. Here is how to create a frequency table using the function FREQUENCY (data_array, bins_array)
:
=FREQUENCY(
,
)
.Hit the Enter
, you will get a frequency table.
Remark 1: In this formula, the values in a bin array should be first \(k-1\)
upper class limits (or the last \(k-1\)
lower class limits), where \(k\)
is the number of bins. In Excel, if the bin array consists of 30, 40, and 50, then the bins will be \((-\infty,30]\)
, \((30,40]\)
, \((40, 50]\)
, \((50, \infty)\)
.
Remark 2: In older version of Excel, you may have to highlight cells for frequencies first, enter the FREQUENCY
function secondly, and then hit Ctrl+Shift+Enter
(or Cmd+Shift+Enter
on Mac).
Excel has many built-in chart functions. To create a charts,
Insert
tab, click on an appropriate chart in the Charts
command set.The appearance of chart can be changed after being created.
Select the data
On the Insert
tab, in the Charts
group, from the Insert Statistic Chart
dropdown list, select Histogram
:
Note: The histogram contains a special first bin which always contains the smallest number. This is different from many textbooks.
To format the histogram chart is similar to format a Pie chart. For example, you can change bin width from Format Axis
.
Right-click on the horizontal axis and choose Format Axis
in the popup menu:
In the Format Axis
pane, on the Axis Options
tab, you may try different options for bins.
Remark:
Excel using a different convention to create histogram. The first bin is a closed interval and other bins are left open and right closed intervals.
Select the Overflow bin checkbox and type the number, all values above this number will be added to the last bin.
Select the Underflow bin checkbox and type the number, all values below and equal to this number will be added to the first bin.
Analysis ToolPak
Suppose your data set is in Column A
in Excel.
In the cell B1
, put the first lower bin limit, which is a number slightly less than the minimum but has more decimal places than the data set.
Create upper bin limits in column C.
In Data menu, look for the Data Analysis ToolPak (if not, go to File > Options > Add-ins > Manage Excel Add-ins, check Analysis ToolPak). In the popup windows, find Histogram.
In the input range, select your data set. In the bin range, select upper bins.
Check Chart Output and hit OK. You will see the frequency table and histogram in Sheet 2.
Change the gap between bars. Right click a bar and choose Format Data Series...
and change the Gap Width
to 2% or 1%.
If you have a raw data set, follow the same procedure a creating a histogram but with a bin width equal the same accuracy of the data. For example, if you data set consists of integers, then choose 1 as the bin-width.
Change the format of bars in the histogram.
Right click a bar and select Format Data Series...
.
Find Fill & Line
and select both Picture or texture fill
and Stack and Scale with
.
Click the button Oneline...
and input dot in search bing
and hit enter.
Select a picture you like and you will get a dot-plot.
Use Excel to complete the following tasks:
Create a random sample of 30 two-digit integers.
Create a histogram with 6 bins for the sample.
Describe the shape of the distribution of the sample of 30 two-digit integers.
Create and interpret graphs (dot plots, pie charts, histograms) as a means of summarizing and communicating data meaningfully.
Identify the shape of a distribution (right-skewed, left-skewed, symmetric, or uniform).
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Alt + f | Fit Slides to Screen |
Esc | Back to slideshow |