Mean and standard deviation of data

Consider a list of data, such as the grades of a student:

4,6,3.5,3,64, 6, 3.5, 3, 6

The average or mean value of the grades is obtained by forming the sum of the grades, and dividing by the number of grades:

m=4+6+3.5+3+65=4.5m = \frac{4+6+3.5+3+6}{5}=4.5

The average value indicates what the typical value is, or the value that "lies somewhere in the middle". The mean value thus calculates a "place" or "position" on the number line, and is therefore also called measure of position. There are several other such measures, e.g. the median, or the quantiles. However, we will not discuss them here.

Another interesting measure expresses how much the data vary. Another student with the same average score of 4.54.5 but with a much smaller variance (e.g. five times 4.54.5) clearly shows a different performance. To qualify such variations in the data, we use the standard deviation. Since the standard deviation measures dispersion, it is also called a measure of dispersion. Again, there are several other measures of dispersion.

The formula for calculation the standard deviation is shown below:

s=(44.5)2+(64.5)2+(3.54.5)2+(34.5)2+(64.5)25=1.6=1.26\begin{array}{lll} s &=& \sqrt{\frac{(4-4.5)^2+(6-4.5)^2+(3.5-4.5)^2+(3-4.5)^2+(6-4.5)^2}{5}}\\ &=&\sqrt{1.6}\\ &=&1.26\end{array}

This formula looks complicated, but its meaning is simple enough to understand: it is the average deviation of the data points from the mean (see figure below).

The squared term (44.5)2(4-4.5)^2 is the difference between the mean 4.54.5 and the data point 44, squared. We square it, so that the difference is always positive. So the formula calculates the average of the squared differences between the mean and the data points, and takes the root of this average.

Clearly, a standard deviation s=0s=0 means that there is no variation at all, and all data points equal exactly 4.54.5. The bigger ss is, the more will the data points deviate from the mean (see figure below).

Some clarifications:

Here is a general definition of the average and the standard deviation:

Definition 1

Consider nn data points x1,...,xnx_1, ..., x_n. The average or mean value of the data points is

m=1ni=1nxi=x1+...+xnn\begin{array}{lll} m &=& \frac{1}{n}\sum_{i=1}^n x_i \\ &=& \frac{x_1+...+x_n}{n} \end{array}

The standard deviation of the data points (from the mean) is

s=1ni=1n(xim)2=(x1m)2+...+(xnm)2n\begin{array}{lll} s &=& \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i-m)^2} \\ &=& \sqrt{\frac{(x_1-m)^2+...+(x_n-m)^2}{n}} \end{array}

The variance of the data points is the squared standard deviation: v=s2v=s^2

Exercise 1
  1. Determine the mean, standard deviation and variance of the data points

    30,43,21.2,030, 43, 21.2, 0
  2. Consider 10001000 melons, of which 20%20\% have weight 3.2kg3.2 kg, 50%50\% have weight 3.5kg3.5 kg, and the remaining 30%30\% have weight 4.1kg4.1 kg. Determine the mean weight and the standard deviation.

Solution
  1. m=23.55,s=15.65,v=245m=23.55, s=15.65, v=245

  2. 200200 melons have weight 3.2kg3.2 kg, 500500 melons have weight 3.5kg3.5 kg, and 300300 melons have weight 4.1kg4.1 kg. So we have

    m=2003.2+5003.5+3004.11000=2003.21000+5003.51000+3004.11000=0.23.2+0.53.5+0.34.1=3.62kgs=200(3.23.62)2+500(3.53.62)2+300(4.13.62)21000=200(3.23.62)21000+500(3.53.62)21000+300(4.13.62)21000=0.2(3.23.62)2+0.5(3.53.62)2+0.3(4.13.62)2=0.334kg\begin{array}{lll} m &=& \frac{200\cdot 3.2 + 500\cdot 3.5 + 300\cdot 4.1}{1000}\\ &=& \frac{200\cdot 3.2}{1000} + \frac{500\cdot 3.5}{1000} + \frac{300\cdot 4.1}{1000}\\ &=& 0.2\cdot 3.2 + 0.5\cdot 3.5 + 0.3\cdot 4.1\\ &=& \underline{3.62 kg}\\ s &=& \sqrt{\frac{200\cdot (3.2-3.62)^2+500\cdot (3.5-3.62)^2+300\cdot (4.1-3.62)^2}{1000}}\\ &=& \sqrt{\frac{200\cdot (3.2-3.62)^2}{1000}+\frac{500\cdot (3.5-3.62)^2}{1000}+\frac{300\cdot (4.1-3.62)^2}{1000}}\\ &=& \sqrt{0.2\cdot (3.2-3.62)^2+ 0.5\cdot (3.5-3.62)^2 + 0.3\cdot (4.1-3.62)^2}\\ &=& \underline{0.334 kg} \end{array}

The solution of the second exercise 1.2 highlights an interesting insight, which will be important in the next chapter: If many of the data points are equal, we can use the percentages of equal data points to calculate mm and ss very easily. In fact, we do not even have to know the total number of data points. Here is an example.

Example 1

Assume that 35%35\% of the data points have the value 7.17.1, and the remaining 65%65\% of the data points have the value 99. The mean and the standard deviation of the data points are:

m=0.357.1+0.659=8.335m=0.35\cdot 7.1 + 0.65\cdot 9=8.335s=0.35(7.18.335)2+0.65(98.335)2=0.90624s=\sqrt{0.35\cdot (7.1-8.335)^2 + 0.65\cdot (9-8.335)^2}=0.90624

Do you understand why? Uncollapse to see the explanation.

Solution

Let as assume that there are nn data points (e.g. n=1000n=1000). There are 0.35n0.35 n data points with value 7.17.1 and 0.65n0.65 n data points with value 99. There are also 0.35n0.35n squared differences (7.1m)2(7.1-m)^2 and 0.65n0.65n squared differences (9m)2(9-m)^2. Thus

m=0.35n7.1+0.65n9n=0.35n7.1n+0.65n9n=0.357.1+0.659=8.335s=0.35n(7.1m)2+0.65n(9m)2n=0.35n(7.18.335)2n+0.65n(98.335)2n=0.35(7.18.335)2+0.65(98.335)2=0.906\begin{array}{lll} m &=& \frac{0.35n \cdot 7.1 + 0.65n\cdot 9}{n}\\ &=& \frac{0.35\cdot n\cdot 7.1}{n} + \frac{0.65\cdot n\cdot 9}{n}\\ &=& 0.35\cdot 7.1+0.65\cdot 9\\ &=& 8.335\\ s &=& \sqrt{\frac{0.35n \cdot (7.1-m)^2 + 0.65n\cdot (9-m)^2}{n}}\\ &=& \sqrt{\frac{0.35\cdot n\cdot (7.1-8.335)^2}{n} + \frac{0.65\cdot n\cdot (9-8.335)^2}{n}}\\ &=& \sqrt{0.35\cdot (7.1-8.335)^2 + 0.65\cdot (9-8.335)^2}\\ &=& 0.906 \end{array}