Discrete and continuous data

We now leave pure probability theory, and move into the field of statistics. Statistics typically deals with data sets obtained in experiments and surveys. Based on these data sets, we then often want to make some far reaching conclusions, for example that men have a lower IQ than females (or vice versa). Statistics can tell us how valid such a conclusion is based on our data set at hand.

Yet, we still need probability theory, simply because data sets are a result of random factors. For example, which person we selected for measuring the IQ is often done on a random basis. We simply select a person at random, and measure the person's IQ. So it pays off to define a data set in terms probability theory:

Definition 1

Consider a random experiment and a random variable XX. Let us repeat the experiment mm times. Recall that a random variable produces a value for each outcome of the experiment. So if we repeat the experiment mm times, we get mm values. We call these mm values a data set, or more specifically a sample.

Example 1
  1. Rolling a fair (or biased) die twice is a random experiment. If XX="the sum of the two numbers", and if we repeat the experiment m=10m=10 times, we might get the data set

    2,7,4,12,12,3,6,8,6,42, 7, 4, 12, 12, 3, 6, 8, 6, 4
  2. Measuring the height of 100100 randomly chosen people. Here the random experiment is "select a person at random", XX="height of person", and m=100m=100. A possible data set could look like this (height in cmcm):

    171.32,156.3201,160.553,200.0021,...171.32, 156.3201, 160.553, 200.0021, ...
  3. Flipping a coin three times is a random experiment. If we define XX="number of heads", and repeat the experiment m=5m=5 times, we might get the data set

    1,0,3,0,21,0,3,0,2

Let us distinguish between two types of data.

Definition 2

If the random variable XX has only certain values as outputs, we call XX and the corresponding datasets discrete. On the other hand if XX can have any possible value within a certain range, we call XX and the corresponding data sets continuous.

As a rule of thumb, data obtained by counting is discrete, while data obtained by measuring is continuous.

Example 2

Examples 1 and 3 above are discrete, example 2 is continuous.

Exercise 1

Determine a possible random experiment, and if the random variable XX is discrete or continuous:

  1. XX="the length of a leaf"

  2. XX="the number of students in a class"

  3. XX="a dog's weight"

  4. XX="the exact amount of water in a 1l1 l bottle."

  5. XX="the dates of birth of a person"

Solution
  1. continuous, possible random experiment is "select a leaf at random in a forest"
  2. discrete, possible random experiment is "select a class at random"
  3. continuous, possible random experiment is "select a dog at in town at random"
  4. continuous, possible random experiment is "select a bottle of water in Migros at random"
  5. This is tricky ... if you are only interested in the exact date (e.g. 23-4-1988), it is discrete, but if you are interested in the exact time, it is continuous. A possible random experiment is "select a person from the street at random."

We have already used the mean mm and the standard deviation ss to describe a data set. A third method is to create a frequency distribution. The frequency distribution tells us which values of the data set occurs how often. In the next two sections we will discuss the frequency distribution separately for discrete and continuous data.