Frequency distribution of discrete data

To find the frequency distribution of a discrete dataset, we simply count how often each possible value occurs. For example, let's a roll a fair die once, set XX="observed number" and choose m=10m=10. The data set might look as follows:

2,5,4,6,6,3,1,1,2,42,5,4,6,6,3,1,1,2,4

The frequency distribution of the data set is shown in a table:

xifreqrel freq yi122/10222/10311/10422/10511/10622/10\begin{array}{l|c|c} x_i & \text{freq} & \text{rel freq } y_i \\\hline 1 & 2 & 2/10\\ 2 & 2 & 2/10\\ 3 & 1 & 1/10\\ 4 & 2 & 2/10\\ 5 & 1 & 1/10\\ 6 & 2 & 2/10\\ \end{array}

Note that we typically express the counts as relative frequencies or percentages of the total number of data points in the set. We can also produce a graph of the frequency table (using the relative frequencies), the so called bar chart. The possible values in the data set are indicated along the xx-axis, and the relative frequencies along the yy-axis:

How does the bar chart look like if there are a lot more data points in the set, say 100000100\,000? In other words, what happens if we repeat the experiment m=100000m=100\,000 times? Well, the relative frequencies would approach the probabilities for observing a 1,2,...,61, 2,...,6, thus we then have

xirel freq yi11/621/631/641/651/661/6\begin{array}{l|c|c} x_i & \text{rel freq } y_i\\\hline 1 & \approx 1/6\\ 2 & \approx 1/6\\ 3 & \approx 1/6\\ 4 & \approx 1/6\\ 5 & \approx 1/6\\ 6 & \approx 1/6\\ \end{array}

which is the probability function of the random variable XX:

p(X=1),p(X=2),...,p(X=6)p(X=1), p(X=2), ..., p(X=6)

Let us summarise:

Summary 1

Consider a discrete data set produced by a random experiment that is repeated mm times, and a discrete random variable XX with the possible outputs x1,...,xmx_1,...,x_m:

x3,x1,x1,x5,x7,x1,...m numbers\underbrace{x_3, x_1, x_1, x_5, x_7, x_1, ...}_{m \text{ numbers}}

Let y1,...,yuy_1,..., y_u be the relative frequencies of the values x1,...,xux_1,...,x_u in the data set (the bar chart of the data set). Then the relative frequencies of the data points in the data set approximate the probability function of XX, that is

y1p(X=x1)y_1 \approx p(X=x_1)y2p(X=x2)y_2 \approx p(X=x_2)......ymp(X=xu) y_m \approx p(X=x_u)

The larger the data set (that is, mm), the better is this approximation.

Exercise 1

A coin with p(H)=0.25p(H)=0.25 is tossed 44 times. Let NN="number of heads". The experiment is performed 10001000 times. Determine the approximate frequency distribution of the data set and sketch the bar chart.

Solution

This is a binomial experiment, where n=4n=4 and the probability of success is p(K)=0.25p(K)=0.25. So we have

p(N=0)=(n0)0.2500.754=0.316p(N=1)=(n1)0.2510.753=0.422p(N=2)=(n2)0.2520.752=0.211p(N=3)=(n3)0.2530.751=0.047p(N=4)=(n4)0.2540.750=0.004\begin{array}{llll} p(N=0)= \left( \begin{array}{ll} n \\ 0 \end{array}\right) 0.25^0 0.75^4 & =0.316 \\ p(N=1)= \left( \begin{array}{ll} n \\ 1 \end{array}\right) 0.25^1 0.75^3 & = 0.422\\ p(N=2)= \left( \begin{array}{ll} n \\ 2 \end{array}\right) 0.25^2 0.75^2 &= 0.211 \\ p(N=3)= \left( \begin{array}{ll} n \\ 3 \end{array}\right) 0.25^3 0.75^1 &= 0.047 \\ p(N=4)= \left( \begin{array}{ll} n \\ 4 \end{array}\right) 0.25^4 0.75^0 &= 0.004 \\ \end{array}

If we run the experiment 10001000 times, NN randomly takes the values 00,11,22,33, and 44. The 00 is assumed in 31.6%31.6\% of these experiments (relative frequency 0.3160.316, so absolute number is 316316), the 11 is assumed in 42.2%42.2\% of the cases (relative frequency 0.422, absolute number 422422), and so on. So we have the following frequency table:

xifrequencyrel frequencyyi03160.31614220.42222110.2113470.047440.004\begin{array}{l|c|c} x_i & \text{frequency} & \text{rel frequency} y_i \\\hline 0 & 316 & 0.316\\ 1 & 422 & 0.422\\ 2 & 211 & 0.211\\ 3 & 47 & 0.047\\ 4 & 4 & 0.004\\ \end{array}

These are only approximate values, since the probabilities only give the percentages if the number of experiments NN is extremely high, much higher than 10001000.