Frequency distribution of continuous data

If the data set is continuous, we cannot simply determine for each possible value how often it occurs. Well, we could, but because all values within a range are possible, most of these values will not be in the dataset, and even if one is, there will probably not be a second or third one.

For example, let's take the continuous data set that results from measuring the weight of $100$ M&Ms (in $\unit{g}$ ):

50.34, 60.2, 54.271, 50.331, ...

Think of the creation of the M&M's as a random process caused by the machine producing these M&M's. Each M&M will look a bit different, weigh a bit different, and so on due to small imperfections of the machines. So we have the random experiment "produce a M&M", and we use the continuous random variable $X$ ="weight of the M&M". We repeat the experiment $m=100$ times to produce the data set above. Clearly, it is highly unlikely that there is a second M&M with the exact weight $50.331$ . So counting all M&M with the weight $50.331$ will not help much for gaining insights into the weight distribution of the data set. It will be a count of $1$ , as is for all other M&M in the data set.

A better approach is to ask how many M&M's are in certain range, e.g. from $48-50 \unit{g}$ , from $50-52 \unit{g}$ , from $52-54 \unit{g}$ , and so on. These ranges are called bins, and the width of the range is called the bin size, written $\Delta x$ (thus $\Delta x=\qty{2}{g}$ ).

So by counting how many M&M's fall into each bin, we might get the following frequency distribution of the data:

\begin{array}{c|c|c} \text{bin } i & \text{freq} & \text{rel freq } y_i & \text{density } d_i \\\hline 48-50 & 2 & 0.02 & 0.01\\ 50-52 & 11 & 0.11 & 0.055\\ 52-54 & 15 & 0.15 & 0.075\\ 54-56 & 39 & 0.39 & 0.195\\ 56-58 & 27 & 0.27 & 0.135\\ 58-60 & 6 & 0.06 & 0.03\\ \end{array}

Definition 1

For continuous data we are typically interested in the density, which is the relative frequency divided by the bin size, and tells as how densely the data points are arranged in a class:

d_i = \frac{y_i}{\Delta x}

The graphical representation of the relative density is the histogram. In a histogram, the bins are plotted on the vertical axis, and a bar is drawn above the bin with the same width as the bin and the height $d_i$ .

Thus, we have

Theorem 1

The area of the bar, $d_i\cdot \Delta x$ , is the relative frequency $y_i$ of the data points in bin $i$ :

Warning

In books or websites you can find that the relative frequency in histograms is sometimes represented by the bar height, and the bar area has no meaning at all. In this course, the relative frequency in histograms is always represented by the bar area.

The choice of the bin size is quite important. If we choose the bin-size too small, we will have a lot of bins containing no data, and some which contain perhaps one or two data points (see below, left). If we choose the bin size too big, we lose details about the distribution of the data points (see below, right). So how do we know which bin size to use? Typically, we find the bin size by trial and error - we try out different bin sizes until the resulting histogram is somewhere between these two extremes (see histogram above).

Exercise 1

Consider the continuous data set given below (heights of students in $\unit{cm}$ ):

162.12, 174.3, 166.62, 180.432, 177.37, 169.22, 156.66, 164.32, 150.23, 183.19, 167.41, 189.77

Draw three histograms using the bin-sizes given below. Start at $\qty{150}{cm}$ and end at $\qty{190}{cm}$ .

$\Delta x=\qty{20}{cm}$
$\Delta x=\qty{10}{cm}$
$\Delta x=\qty{5}{cm}$

Show

$\Delta x=\qty{20}{cm}$ :

\begin{array}{c|c|c|l} \text{bin } i & \text{freq} & \text{rel freq } y_i & \text{density } d_i \\\hline 150-170 & 7 & 0.583 & 0.029\\ 170-190 & 5 & 0.417 & 0.021\\ \end{array}

$\Delta x=\qty{10}{cm}$ :

\begin{array}{c|c|c|l} \text{bin } i & \text{freq} & \text{rel freq } y_i & \text{density } d_i \\\hline 150-160 & 2 & 0.167 & 0.0167\\ 160-170 & 5 & 0.416 & 0.0416\\ 170-180 & 2 & 0.167 & 0.0167\\ 180-190 & 3 & 0.25 & 0.025\\ \end{array}

$\Delta x=\qty{5}{cm}$ :

\begin{array}{c|c|c|l} \text{bin } i & \text{freq} & \text{rel freq } y_i & \text{density } d_i \\\hline 150-155 & 1 & 0.083 & 0.0167\\ 155-160 & 1 & 0.083 & 0.0167\\ 160-165 & 2 & 0.167 & 0.034\\ 165-170 & 3 & 0.250 & 0.05\\ 170-175 & 1 & 0.083 & 0.0167\\ 175-180 & 1 & 0.083 & 0.0167\\ 180-185 & 2 & 0.167 & 0.034\\ 185-190 & 1 & 0.083 & 0.0167\\ \end{array}

Returning to our M&M example, recall that we have introduced the continuous random variable $X=$ "weight of M&M". As $y_i$ is the relative frequency of weights in bin $i$ , we have the following:

Theorem 2

The area of the bar $i$ approximates the probability that $X$ will take on a value in bin $i$ :

p(X\in \text{ bin $i$}) \approx d_i\cdot \Delta x

The larger the number of data points $m$ in the data set, the better is the approximation.

Exercise 2

You bought a huge box of nails and want to know more about the length $X$ of these nails (in $\unit{cm}$ ). Randomly selecting several hundered nails and measureing its length results in the following histogram:

Give an estimate of the following probabilities:

$p(X \in [2, 4])$
$p(X \in [4, 8])$

How can you improve the estimate?

Solution

As the bar areas in a histogram are given by the bar are, and the relative frequency approximates the probability, we get

$p(X \in [2, 4]) \approx 2\cdot 0.05=0.1$ (bar area above $[2,4]$ )
$p(X \in [4, 8]) \approx 2\cdot 0.1+2\cdot 0.2=0.6$ (sum of bar areas above $[4,6]$ and $[6,8]$ )

The approximations get better with increasing sample size.