The Sample Mean

Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Contact us Links

> Home > Statistics Notes > Probability > The Sample Mean

The Sample Mean

     The sample mean of a collection of values is defined to be the sum of those values divided by the number of such values. It is their average. For instance, suppose one has the following data: 10, 30, 20, 20, 10, 20, 10, 30, 10, 20. There are ten data points, and the sum is 10+30+20+20+10+20+10+30+10+20=190. Therefore the average, or sample mean, of these data, is 190/10.
     There is another way to calculate the sample mean. These numbers can be rearranged this way: 10+10+10+10+20+20+20+20+30+30, which is the same as 4*10+4*20+2*30. In other words, why not just count up the number of times the 10 occurred, and just multiply, and do the same for each of the possible values - and then add that up? That is another way to compute the sum in the numerator.
     Also, we need to divide by the number of data points. Let's do that too: the average will be equal to (4*10+4*20+2*30)/10. We can also divide each term by the 10, and we find that the average is (4/10)*10 + (4/10)*20 + (2/10)*30? It is 4 occurrences of this value divided by 10, or 4/10. What is the relative frequency of occurrence of the value twenty? It is 4 occurrences of this value divided by 10, or 4/10. And finally what is the relative frequency of occurrence of the value thirty? It is two occurrences of this value divided by 10, or 2/10.
     So it is possible to compute the average of some data by first calculating the relative frequency of each possible value, then multiplying each value by its relative frequency, and then adding this all together. In general, the sample mean of data points X_i, where i goes from 1 to N. Here, N is the number of data values. Let f_x be the relative frequency of the value x. Let's call the average or sample mean m. Then the sample mean is found by taking each value of x, calculating xf_x, and adding them up.
     To do another example, suppose I want to take the average of the values


1,2,3,2,3,4,3,6,1,2,1,4,2,3,3,2,1,7,1,2,3,1,1,1,1,1,3,2,3,4

There are only six different values, 1,2,3,4,6,7, and the final sum we do is going to have six terms, one for each of them. There are 30 data values. The following table summarizes the calculation:

Data value, x	Number of occurrences	Relative frequency, f_x	xf_x
1	10	10/30	1*10/30
2	7	7/30	2*7/30
3	8	8/30	3*8/30
4	3	3/30	4*3/30
6	1	1/30	6*1/30
7	1	1/30	7*1/30

The mean is then 73/30, or approximately 2.4333.
When calculating the mean, you will notice that the large values may be balanced by the smaller values. The mean itself is a measure of central tendency, because it gives an idea of where in some sense the data are. For instance, if I have one small data set, 1,4,2,3, and another small data set, 450, 120, 300, 200, then one way of summarizing the location of the two groups relative to each other is by comparing the averages; the mean of the second sample is much larger than the mean of the first sample.
The mean is sensitive to outlying points. Try this computer exercise:


> x<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1000)

> mean(x)

Almost all the numbers happen to be 1, but the mean is much larger than that. Almost all the numbers are smaller than the average. Try different values for that last element instead of 1000, and see what happens.
     The relative frequency can be thought of as a kind of sample mean. Suppose that I am computing the relative frequency of a certain event. At every experiment, I can define a random quantity whose value is 1 when the event occurs and 0 when it doesn't. If I have done the experiment N times, I have N numbers which are 0 and 1. Adding these together automatically generates the number of times the event has occurred, and dividing by N computes the average of these 0's and 1's. In general, a quantity that takes the value 1 when some event happens and 0 when it doesn't is called an indicator variable. The relative frequency is the average of the indicator variables for the event.
     The sample mean gives a kind of typical value "in the middle" of a distribution. But the sample mean might not occur in the actual data; the sample mean might not even be a possible value of the actual data - how would anyone have 2.2 children? And sometimes the average of two things may not be quite what is needed - here is an old joke: two statisticians go hunting. After a long day of tracking and stalking, the first fires at a target and misses 5 feet to the left. The second aims at the now-fleeing target, and fires, but misses 5 feet to the right, as the target disappears into the woods. But the first statistician exults - got him!
     Other than the sample mean, there are other ways to create a measure of central tendency. The particular average we discussed is sometimes called the arithmetic mean, to distinguish it from other quantities such as a weighted arithmetic mean, or even the geometric or harmonic means (which we won't discuss in this class). Another useful measure of central tendency is called the sample median, which we will discuss later.

On to expectation values.

Return to statistics page.
Return to probability page.
Return to stochastic seminar.