Frequency distribution of discrete data

To find the frequency distribution of a discrete dataset, we simply count how often each possible value occurs. For example, let's a roll a fair die once, set $X$ ="observed number" and choose $m=10$ . The data set might look as follows:

2,5,4,6,6,3,1,1,2,4

The frequency distribution of the data set is shown in a table:

\begin{array}{l|c|c} x_i & \text{freq} & \text{rel freq } y_i \\\hline 1 & 2 & 2/10\\ 2 & 2 & 2/10\\ 3 & 1 & 1/10\\ 4 & 2 & 2/10\\ 5 & 1 & 1/10\\ 6 & 2 & 2/10\\ \end{array}

Note that we typically express the counts as relative frequencies or percentages of the total number of data points in the set. We can also produce a graph of the frequency table (using the relative frequencies), the so called bar chart. The possible values in the data set are indicated along the $x$ -axis, and the relative frequencies along the $y$ -axis:

How does the bar chart look like if there are a lot more data points in the set, say $100\,000$ ? In other words, what happens if we repeat the experiment $m=100\,000$ times? Well, the relative frequencies would approach the probabilities for observing a $1, 2,...,6$ , thus we then have

\begin{array}{l|c|c} x_i & \text{rel freq } y_i\\\hline 1 & \approx 1/6\\ 2 & \approx 1/6\\ 3 & \approx 1/6\\ 4 & \approx 1/6\\ 5 & \approx 1/6\\ 6 & \approx 1/6\\ \end{array}

which is the probability function of the random variable $X$ :

p(X=1), p(X=2), ..., p(X=6)

Let us summarise:

Summary 1

Consider a discrete data set produced by a random experiment that is repeated $m$ times, and a discrete random variable $X$ with the possible outputs $x_1,...,x_m$ :

\underbrace{x_3, x_1, x_1, x_5, x_7, x_1, ...}_{m \text{ numbers}}

Let $y_1,..., y_u$ be the relative frequencies of the values $x_1,...,x_u$ in the data set (the bar chart of the data set). Then the relative frequencies of the data points in the data set approximate the probability function of $X$ , that is

y_1 \approx p(X=x_1)

y_2 \approx p(X=x_2)

...

y_m \approx p(X=x_u)

The larger the data set (that is, $m$ ), the better is this approximation.

Exercise 1

A coin with $p(H)=0.25$ is tossed $4$ times. Let $N$ ="number of heads". The experiment is performed $1000$ times. Determine the approximate frequency distribution of the data set and sketch the bar chart.

Solution

This is a binomial experiment, where $n=4$ and the probability of success is $p(K)=0.25$ . So we have

\begin{array}{llll} p(N=0)= \left( \begin{array}{ll} n \\ 0 \end{array}\right) 0.25^0 0.75^4 & =0.316 \\ p(N=1)= \left( \begin{array}{ll} n \\ 1 \end{array}\right) 0.25^1 0.75^3 & = 0.422\\ p(N=2)= \left( \begin{array}{ll} n \\ 2 \end{array}\right) 0.25^2 0.75^2 &= 0.211 \\ p(N=3)= \left( \begin{array}{ll} n \\ 3 \end{array}\right) 0.25^3 0.75^1 &= 0.047 \\ p(N=4)= \left( \begin{array}{ll} n \\ 4 \end{array}\right) 0.25^4 0.75^0 &= 0.004 \\ \end{array}

If we run the experiment $1000$ times, $N$ randomly takes the values $0$ , $1$ , $2$ , $3$ , and $4$ . The $0$ is assumed in $31.6\%$ of these experiments (relative frequency $0.316$ , so absolute number is $316$ ), the $1$ is assumed in $42.2\%$ of the cases (relative frequency 0.422, absolute number $422$ ), and so on. So we have the following frequency table:

\begin{array}{l|c|c} x_i & \text{frequency} & \text{rel frequency} y_i \\\hline 0 & 316 & 0.316\\ 1 & 422 & 0.422\\ 2 & 211 & 0.211\\ 3 & 47 & 0.047\\ 4 & 4 & 0.004\\ \end{array}

These are only approximate values, since the probabilities only give the percentages if the number of experiments $N$ is extremely high, much higher than $1000$ .