The probability density function

Recall that for a discrete random variable XX with the values x1,...,xux_1,...,x_u, the list of probabilities

p(X=x1),...,p(X=xu)p(X=x_1), ..., p(X=x_u)

is called the probability function of XX. It tells you how the probabilities are distributed over the different values of XX.

We want to define an analogues function for a continuous random variable. Remeber that a continuous the random variable XX can take on any possible value in a given interval II (or even on the whole line I=],[I=]\infty,\infty[). What exactly this interval is depends on the problem.

Example 1

Going back to the M&M-Example, the weight could be any value between 40g40g and 70g70g, such as X=40.331X=40.331 or even X=69.8213503104531390786574X=69.8213503104531390786574, but (because of the production process, perhaps) will never go lower than 40g40g and never exceed 70g70g. In this case we could choose the interval to be I=[40,70]I=[40,70].

However, this poses two problems. First, how do we list the probabilties of a continuous random variable? We cannot form a list as in the discrete case above, because we cannot list all real numbers in an intervall (the real numbers in any interval are not countable).

Example 2

Back to M&Ms where the interval is [40,70][40,70]. What follows after 4040, is it 40.140.1 or 40.0140.01, or 40.00140.001, or 40.00000140.000001? I hope you see the problem.

A second problem is that it is p(X=x)0p(X=x)\approx 0 for every value that XX can take on (that is, for every value xIx \in I). So for the M&Ms, it will be p(X=40)0p(X=40)\approx 0 or p(X=40.00001)0p(X=40.00001)\approx 0 and so on. Why? It is simply unlikely that several M&Ms have exactly the same weight 4040 or 40.0000140.00001, meaning that the relative frequency is always very close to zero. So even if we could list in some way all the probabilities, there would be very little useful information in such a list.

To circumvent these problems, we have to take another approach to capture the probability distribution of a continuous random variable. The basic idea is to replace the probability function of a discrete random variable:

p(X=x1),p(X=x2),...,p(X=xu)p(X=x_1), p(X=x_2), ..., p(X=x_u)

with small intervals for for continous variables:

p(XI1),p(XI2),...,p(XIv)p(X\in I_1), p(X\in I_2), ..., p(X\in I_v)

where the intervals I1,I2,...,IvI_1, I_2, ..., I_v divide the interval II. This is just the basic idea. It is not really workable in this version because we do not have a natural way to find the intervals I1,I2,...,IvI_1, I_2, ..., I_v . How big or small do they have to be? How many of them should we choose? In fact we need to approach this problem from a different angle.

We start by defining a new type of function, the so called probability density function of XX.

Definition 1

Consider a continuous random variable XX, with values in the interval I=[a,b]I=[a,b]. The probability density function of XX, written fXf_X, is a function with the following properties:

  1. fX(x)0f_X(x)\geq 0 for all xIx \in I
  2. p(X[c,d])=cdfX(x)dxp(X\in [c,d])=\int_c^d f_X(x)\, dx\quad for every interval [c,d]I[c,d]\subset I.
Note 1

a=a=-\infty or b=b=\infty is also possible, that means, intervals like I=],b]I=]-\infty,b], I=[a,[I=[a,\infty[, or I=],[I=]-\infty,\infty[.

In other words, the graph of fXf_X is never below the xx-axis, and the probability that XX takes on a value in the interval [c,d][c,d] is the area beneath the graph of fXf_X (from cc to dd). See the figure below.

As p(XI)=1p(X\in I)=1 (as XX cannot take on any other values, but will always produce a value), the following is valid:

Theorem 1

For a probability density function fXf_X is

abfX(x)dx=1\int_a^b f_X(x)\, dx=1

That is, the total area beneath the graph of fXf_X equals 11.

Every continuous random variable XX has such a probability density function (apart from some very strange exceptions). Now, the big question is, of course, how do we find fXf_X for a given random variable XX. It turns out that the graph of fXf_X approximately corresponds to the curve formed by the histogram of XX. To be more precise:

Theorem 2

To find the graph of the probability density function of a random variable XX:

  1. create a huge number of datapoints drawn from XX (that is, we repeat the experiment a huge number of times and collect the values of XX such as the weight of M&Ms)
  2. create a histogram of the datapoints with a really small bin size Δx\Delta x

The graph of fXf_X at any point xx is then formed by the bar height dxd_x at xx:

fX(x)dxf_X(x) \approx d_x

The more data points are used in the histogram and the smaller Δx\Delta x is chosen, the better is this approximation.

Proof

Please study the proof, as it helps to understand the issues a bit better.

First, consider mm data points from XX by repeating the experiment mm times and let's create a histogram of bin size Δx\Delta x (figure above left, blue bars). As we have seen in the last chapter, the area of bar ii is diΔxd_i\cdot \Delta x, which is the relative frequency and thus approximately the probability that a data point lands in the bin $x_i,x_{i+1}.

blue bar areai=p(X[xi,xi+1])\text{blue bar area} i = \approx p(X\in [x_i,x_{i+1}])

(the more data points we have, the better is this approximation). But from the definition of the probability density function we also know that the area under the curve of fXf_X between xix_i and xi+1x_{i+1} is exactly p(X[xi,xi+1])p(X\in [x_i,x_{i+1}]) (see figure above left, red area):

red bar areai=p(X[xi,xi+1])\text{red bar area} i = p(X\in [x_i,x_{i+1}])

Thus we have

blue bar areaired bar areai\text{blue bar area} i \approx \text{red bar area} i

But because the width of both areas is the same, Δx\Delta x, we also find that the height of the areas is about the same:

height blue bar iheight red area i\text{height blue bar } i \approx \text{height red area } i

But note that there are many heights in the red area, as it is curved on top. But if we choose Δx\Delta x really small, all heights will approximately be the same, namely fX(x)f_X(x). Thus, we obtain the following result:

fX(xi)di(mbig,Δxsmall)f_X(x_i) \approx d_i\quad (m\, \text{big}, \Delta x\, \text{small})

And this concludes the proof. Note that there is a more elegant but also a bit more abstract proof, which we quickly show below.

Alternative proof

Let FXF_X be the antiderivative of fXf_X, that is, FX=fXF^\prime_X=f_X. The bar area ii in the histogram is, as above

Δxdip(X[xi,xi+1])=xixi+1fX(x)dx=FX(xi+1)FX(xi)\begin{array}{ll} \Delta x \cdot d_i &\approx& p(X\in [x_i, x_{i+1}])\\ &=&\int_{x_i}^{x_{i+1}} f_X(x)\, dx\\ &=& F_X(x_{i+1})-F_X(x_i) \end{array}

where we used the fundamental theorem of calculus to bring the antiderivative into play. Now, because xi+1=xi+Δxx_{i+1}=x_i+\Delta x, we get

ΔxdiF(xi+Δx)F(xi)\Delta x\cdot d_i \approx F(x_i+\Delta x)-F(x_i)

and thus

diF(xi+Δx)F(xi)ΔxFX(xi)=fX(xi)d_i \approx \frac{F(x_i+\Delta x)-F(x_i)}{\Delta x} \approx F^\prime_X(x_i)=f_X(x_i)

Thus we see again that

fX(xi)dif_X(x_i)\approx d_i

Have a look at the proof of this theorem, but also play a bit with the sliders in the geogebra applet below. Observe how the histogram gets smoother and approximates the probability density function for a large number of points nn and small bin size Δx\Delta x. Actually, we would need a lot more points nn and a much smaller bin size Δx\Delta x to get a really smooth histogram that overlaps exactly with fXf_X, but you should get the idea.

Open in GeoGebra

Knowing the graph of fXf_X does not necessarily mean we can find its function equation easily. To find such a function equation we often make an educated guess about fXf_X and then verify our assumption by comparing the histogram with the graph of fXf_X. But we will not do this here.

Exercise 1
  1. Argue, why is abfX(x)dx=1\int_{a}^{b} f_X(x)\, dx=1 for every density probability distribution, where XX takes on values in the interval I=[a,b]I=[a,b].

  2. Consider a random experiment with a continuous random variable XX whose probability density function is

    fX(x)={3434x2x[1,1]0x∉[1,1]f_X(x)=\begin{cases}\frac{3}{4}-\frac{3}{4}x^2 & x\in [-1,1] \\ 0 & x\not\in [-1,1] \end{cases}
    1. Draw the probability density function.

    2. Determine the probability that the observed value of XX is between 0.40.4 and 0.70.7, that is, determine the probability p(X[0.4,0.7])p(X\in [0.4,0.7]).

Solution
  1. abfX(x)dx=p(X[a,b])=1\int_{a}^{b} f_X(x)\, dx=p(X\in [a, b])=1.
  2. p(X[0.4,0.7])=0.40.7fX(x)dxp(X\in [0.4,0.7])=\int_{0.4}^{0.7} f_X(x)\, dx. The antiderivative of ff is
    1. The graph is

    2. We have

      F(x)=34x3413x3=34x14x3\begin{array}{lll} F(x)&=&\frac{3}{4}x-\frac{3}{4}\frac{1}{3}x^3\\ &=&\frac{3}{4}x-\frac{1}{4}x^3 \end{array}

      and therefore we have

      p(X[0.4,0.7])=0.40.7fX(x)dx=F(0.7)F(0.4)=340.7140.73(340.4140.43)=0.155\begin{array}{lll} p(X\in [0.4,0.7])&=&\int_{0.4}^{0.7} f_X(x)\, dx\\ &=&F(0.7)-F(0.4)\\ &=&\frac{3}{4}\cdot 0.7-\frac{1}{4}\cdot 0.7^3-(\frac{3}{4}\cdot 0.4-\frac{1}{4}\cdot 0.4^3)\\ &=& \underline{0.155} \end{array}