Frequency distribution of continuous data

If the data set is continuous, we cannot simply determine for each possible value how often it occurs. Well, we could, but because all values within a range are possible, most of these values will not be in the dataset, and even if one is, there will probably not be a second or third one.

For example, let's take the continuous data set that results from measuring the weight of 100100 M&Ms (in g\unit{g}):

50.34,60.2,54.271,50.331,...50.34, 60.2, 54.271, 50.331, ...

Think of the creation of the M&M's as a random process caused by the machine producing these M&M's. Each M&M will look a bit different, weigh a bit different, and so on due to small imperfections of the machines. So we have the random experiment "produce a M&M", and we use the continuous random variable XX="weight of the M&M". We repeat the experiment m=100m=100 times to produce the data set above. Clearly, it is highly unlikely that there is a second M&M with the exact weight 50.33150.331. So counting all M&M with the weight 50.33150.331 will not help much for gaining insights into the weight distribution of the data set. It will be a count of 11, as is for all other M&M in the data set.

A better approach is to ask how many M&M's are in certain range, e.g. from 4850g48-50 \unit{g}, from 5052g50-52 \unit{g}, from 5254g52-54 \unit{g}, and so on. These ranges are called bins, and the width of the range is called the bin size, written Δx\Delta x (thus Δx=2  g\Delta x=\qty{2}{g}).

So by counting how many M&M's fall into each bin, we might get the following frequency distribution of the data:

bin ifreqrel freq yidensity di485020.020.015052110.110.0555254150.150.0755456390.390.1955658270.270.135586060.060.03\begin{array}{c|c|c} \text{bin } i & \text{freq} & \text{rel freq } y_i & \text{density } d_i \\\hline 48-50 & 2 & 0.02 & 0.01\\ 50-52 & 11 & 0.11 & 0.055\\ 52-54 & 15 & 0.15 & 0.075\\ 54-56 & 39 & 0.39 & 0.195\\ 56-58 & 27 & 0.27 & 0.135\\ 58-60 & 6 & 0.06 & 0.03\\ \end{array}
Definition 1

For continuous data we are typically interested in the density, which is the relative frequency divided by the bin size, and tells as how densely the data points are arranged in a class:

di=yiΔxd_i = \frac{y_i}{\Delta x}

The graphical representation of the relative density is the histogram. In a histogram, the bins are plotted on the vertical axis, and a bar is drawn above the bin with the same width as the bin and the height did_i.

Thus, we have

Theorem 1

The area of the bar, diΔxd_i\cdot \Delta x, is the relative frequency yiy_i of the data points in bin ii:

Warning

In books or websites you can find that the relative frequency in histograms is sometimes represented by the bar height, and the bar area has no meaning at all. In this course, the relative frequency in histograms is always represented by the bar area.

The choice of the bin size is quite important. If we choose the bin-size too small, we will have a lot of bins containing no data, and some which contain perhaps one or two data points (see below, left). If we choose the bin size too big, we lose details about the distribution of the data points (see below, right). So how do we know which bin size to use? Typically, we find the bin size by trial and error - we try out different bin sizes until the resulting histogram is somewhere between these two extremes (see histogram above).

Exercise 1

Consider the continuous data set given below (heights of students in cm\unit{cm}):

162.12,174.3,166.62,180.432,177.37,169.22,156.66,164.32,150.23,183.19,167.41,189.77162.12, 174.3, 166.62, 180.432, 177.37, 169.22, 156.66, 164.32, 150.23, 183.19, 167.41, 189.77

Draw three histograms using the bin-sizes given below. Start at 150  cm\qty{150}{cm} and end at 190  cm\qty{190}{cm}.

  1. Δx=20  cm\Delta x=\qty{20}{cm}
  2. Δx=10  cm\Delta x=\qty{10}{cm}
  3. Δx=5  cm\Delta x=\qty{5}{cm}
Show

Δx=20  cm\Delta x=\qty{20}{cm}:

bin ifreqrel freq yidensity di15017070.5830.02917019050.4170.021\begin{array}{c|c|c|l} \text{bin } i & \text{freq} & \text{rel freq } y_i & \text{density } d_i \\\hline 150-170 & 7 & 0.583 & 0.029\\ 170-190 & 5 & 0.417 & 0.021\\ \end{array}

Δx=10  cm\Delta x=\qty{10}{cm}:

bin ifreqrel freq yidensity di15016020.1670.016716017050.4160.041617018020.1670.016718019030.250.025\begin{array}{c|c|c|l} \text{bin } i & \text{freq} & \text{rel freq } y_i & \text{density } d_i \\\hline 150-160 & 2 & 0.167 & 0.0167\\ 160-170 & 5 & 0.416 & 0.0416\\ 170-180 & 2 & 0.167 & 0.0167\\ 180-190 & 3 & 0.25 & 0.025\\ \end{array}

Δx=5  cm\Delta x=\qty{5}{cm}:

bin ifreqrel freq yidensity di15015510.0830.016715516010.0830.016716016520.1670.03416517030.2500.0517017510.0830.016717518010.0830.016718018520.1670.03418519010.0830.0167\begin{array}{c|c|c|l} \text{bin } i & \text{freq} & \text{rel freq } y_i & \text{density } d_i \\\hline 150-155 & 1 & 0.083 & 0.0167\\ 155-160 & 1 & 0.083 & 0.0167\\ 160-165 & 2 & 0.167 & 0.034\\ 165-170 & 3 & 0.250 & 0.05\\ 170-175 & 1 & 0.083 & 0.0167\\ 175-180 & 1 & 0.083 & 0.0167\\ 180-185 & 2 & 0.167 & 0.034\\ 185-190 & 1 & 0.083 & 0.0167\\ \end{array}

Returning to our M&M example, recall that we have introduced the continuous random variable X=X="weight of M&M". As yiy_i is the relative frequency of weights in bin ii, we have the following:

Theorem 2

The area of the bar ii approximates the probability that XX will take on a value in bin ii:

p(X bin i)diΔxp(X\in \text{ bin $i$}) \approx d_i\cdot \Delta x

The larger the number of data points mm in the data set, the better is the approximation.

Exercise 2

You bought a huge box of nails and want to know more about the length XX of these nails (in cm\unit{cm}). Randomly selecting several hundered nails and measureing its length results in the following histogram:

Give an estimate of the following probabilities:

  1. p(X[2,4])p(X \in [2, 4])
  2. p(X[4,8])p(X \in [4, 8])

How can you improve the estimate?

Solution

As the bar areas in a histogram are given by the bar are, and the relative frequency approximates the probability, we get

  1. p(X[2,4])20.05=0.1p(X \in [2, 4]) \approx 2\cdot 0.05=0.1 (bar area above [2,4][2,4])
  2. p(X[4,8])20.1+20.2=0.6p(X \in [4, 8]) \approx 2\cdot 0.1+2\cdot 0.2=0.6 (sum of bar areas above [4,6][4,6] and [6,8][6,8])

The approximations get better with increasing sample size.