Discrete and continuous data

We now leave pure probability theory, and move into the field of statistics. Statistics typically deals with data sets obtained in experiments and surveys. Based on these data sets, we then often want to make some far reaching conclusions, for example that men have a lower IQ than females (or vice versa). Statistics can tell us how valid such a conclusion is based on our data set at hand.

Yet, we still need probability theory, simply because data sets are a result of random factors. For example, which person we selected for measuring the IQ is often done on a random basis. We simply select a person at random, and measure the person's IQ. So it pays off to define a data set in terms probability theory:

Definition 1

Consider a random experiment and a random variable $X$ . Let us repeat the experiment $m$ times. Recall that a random variable produces a value for each outcome of the experiment. So if we repeat the experiment $m$ times, we get $m$ values. We call these $m$ values a data set, or more specifically a sample.

Example 1

Rolling a fair (or biased) die twice is a random experiment. If $X$ ="the sum of the two numbers", and if we repeat the experiment $m=10$ times, we might get the data set
$2, 7, 4, 12, 12, 3, 6, 8, 6, 4$
Measuring the height of $100$ randomly chosen people. Here the random experiment is "select a person at random", $X$ ="height of person", and $m=100$ . A possible data set could look like this (height in $cm$ ):
$171.32, 156.3201, 160.553, 200.0021, ...$
Flipping a coin three times is a random experiment. If we define $X$ ="number of heads", and repeat the experiment $m=5$ times, we might get the data set
$1,0,3,0,2$

Let us distinguish between two types of data.

Definition 2

If the random variable $X$ has only certain values as outputs, we call $X$ and the corresponding datasets discrete. On the other hand if $X$ can have any possible value within a certain range, we call $X$ and the corresponding data sets continuous.

As a rule of thumb, data obtained by counting is discrete, while data obtained by measuring is continuous.

Example 2

Examples 1 and 3 above are discrete, example 2 is continuous.

Exercise 1

Determine a possible random experiment, and if the random variable $X$ is discrete or continuous:

$X$ ="the length of a leaf"
$X$ ="the number of students in a class"
$X$ ="a dog's weight"
$X$ ="the exact amount of water in a $1 l$ bottle."
$X$ ="the dates of birth of a person"

Solution

continuous, possible random experiment is "select a leaf at random in a forest"
discrete, possible random experiment is "select a class at random"
continuous, possible random experiment is "select a dog at in town at random"
continuous, possible random experiment is "select a bottle of water in Migros at random"
This is tricky ... if you are only interested in the exact date (e.g. 23-4-1988), it is discrete, but if you are interested in the exact time, it is continuous. A possible random experiment is "select a person from the street at random."

We have already used the mean $m$ and the standard deviation $s$ to describe a data set. A third method is to create a frequency distribution. The frequency distribution tells us which values of the data set occurs how often. In the next two sections we will discuss the frequency distribution separately for discrete and continuous data.