[This text has been prepared by Ian Johnston of Malaspina University-College, in Nanaimo, BC for use in Liberal Studies. The text is in the public domain, released May 2000]
In the previous sections we have talked
about frequency distributions. These
refer, you will recall, to the particular scores in a set of results. Some results are more frequent than others,
and the density of the distribution may vary (e.g., clustered near to the mean,
spread out widely on either side of the mean, bunched at either end of the
values, and so on). There are innumerable
ways in which the scores in a set of results can be distributed.
Of particular interest to us in the
remainder of this module is a frequency distribution in which the dispersion of
the scores is symmetrical about the central mean, that is, in which the
frequency of results above the average matches exactly the frequency of results below the average and in which the
most frequent result falls at the average (see Histogram D in the sample
histograms in Section Two). In such a
distribution, the high point (the most frequent results will be in the middle
of the distribution, and each side of the high point will be a mirror image of
the other.
For such a distribution, the histogram
would be perfectly symmetrical around the centre; in other words, the tallest
column (i.e., the most frequent value) will occur exactly in the centre of the
diagram and the other columns (frequencies) will fall away on either side of
the central value equally on either side.
Such a perfectly symmetrical distribution is called a normal distribution (in popular
language the shape of this frequency distribution is commonly called a bell
curve).
Notice that a normal distribution may
come with very different dimensions (tall and skinny, short and wide), but the
characteristics mentioned above hold in all cases (the high point, i.e., the
most frequent value, is always in the centre, and the two sides of the curve
are perfectly symmetrical). In other
words, the characteristic bell shape is always present.
Here are a few examples of histograms
illustrating normal distribution. These
histograms illustrate the probability distribution for success in various coin
tosses. The x-axis here indicates the
number of heads in a particular sequence of coin tosses; the y-axis represents
the theoretical frequency of that result in the given number of fair
tosses.
The histograms will have different
sizes and shapes, because the frequency distribution changes with the number of
tosses. But notice that all the
histograms are perfectly symmetrical around the centre (the tallest and
therefore most frequent value).
Remember, once again, that these
diagrams represent probability distributions, or the frequency of results
theoretically calculated. And since the
total of all the probabilities for an event equals 1, the shaded area contained
in all the columns equals 1.
This diagram indicates that in a three coin toss
sequence (or three coins tossed simultaneously) there are four possible
results: 0 heads, 1 head, 2 heads, and 3 heads (the values on the X-Axis). The percentage frequency of these four
possibilities we read off the Y-Axis.
We read the following diagrams in the same way: the number of heads on
the X-Axis, and the percent probability on the Y-Axis. Notice the perfect symmetry in these
distributions.
Notice in the above histogram (for 20
coin tosses) how at the extremes (0, 1, 2, 18, 19, 20) the percent probability
is so small that the value does not show on the graph. Virtually all the results in a 20 coin-toss
sequence will fall between 3 and 17, with the most frequent value in the centre
(at 10). The frequencies on either side
of 10 are perfectly symmetrical (we can see that by the equal heights of 9 and
11, of 8 and 12, of 7 and 13, of 6 and 14, of 5 and 15, or 4 and 16, of 3 and
17.
Notice how in the diagrams above, as
the number of columns increases, the entire shape of the histogram begins to
approximate a curve, with the shaded areas all under the top line. And, in fact, we can readily convert these
histograms (using rectangles) to a curve by joining up the central points on
the top of each column.
What we have when we do this is exactly
the same frequency distribution picture as we had with the columns, except that
we have filled in the gaps between columns.
Now we do not have the body of the columns, but that does not matter,
because the important part of the histogram picture is the line defined by the
top centre points of the columns (which indicates the percent probability of
any particular value along the x-axis).
In such a diagram, the important factor is the area under the curve, for
that graphically presents the total frequencies. Equal areas under such a curve will represent equal frequencies
(more on this later).
When we join up the columns in the
histogram in this way, we produce a particularly useful statistical shape, the normal curve (1).
The normal curve or the normal
distribution is an extremely important statistical concept, as important in
many areas of enquiry as the right-angle triangle is in Euclidean geometry, and
for the remainder of our short study of statistics we shall be dealing only with
this frequency distribution. So
understand clearly what the normal distribution means.
When we say that a particular
population characteristic is normally distributed, we mean the following:
1.
The
normal frequency curve shows that the highest frequency falls in the centre
(i.e., at the mean of the values in the distribution) with an equal and exactly
similar curve on either side of that centre.
Thus, the most frequent value in a normal distribution is the average,
with half the values falling below the average and half above it.
2.
The
normal curve, often called a bell curve, is perfectly symmetrical. Therefore the median (the arithmetical
average), the mode (the most frequent value), and the median (the middle value)
will coincide at the centre of the curve (the high point). Make sure you understand this point.
3.
The
further away any particular value is from the average (above or below), the
less frequent that value will be (i.e., the frequencies will diminish on either
side of the high central point).
4.
Because
the two halves on either side of the centre are exactly symmetrical, the
frequency of values above the mean will match exactly the frequencies of values
below the mean, provided the distances between the values and mean are
identical. Thus, the frequency of a
value 3 units to the right of the mean will be identical to the frequency of
the value 3 units to the left of the mean.
This is a key idea; please make sure you understand it.
5.
The total
frequency of all values in the population will be contained by the area under
the curve. This is obvious enough,
since the total area under the curve represents all the possible occurrences of
that characteristic.
6.
Various
areas under the curve will therefore indicate the percentage of the total
frequency. For instance, 50 percent of
the area under the curve lies to the left of the mean (i.e., half of all
normally distributed results will fall in this area), and 50 percent of the
area under the curve lies to the right of the mean. Therefore, 50 percent of all scores will lie to the left and 50
percent to the right of the mean. Equal
areas under the curve represent equal numbers in the frequency. Again, please make sure you understand this
important idea.
7.
Normal
curves may have different shapes (i.e., tall and skinny, short and low, and so
on). What will determine the overall
shape of the symmetrical curve will the value of the mean and the standard
deviation in the population (these will define the shape in the same way the
centre point and the radius define a circle).
But the general characteristics listed above will remain the same.
Please make very sure you understand
each one of the above points, because much of what we do from this point on
assumes that you are quite familiar with the properties of the normal curve.
Normal distributions are particularly
important for a number of reasons (as we shall see), not the least of which is
that many of the important characteristics we wish to study (including all
inherited characteristics) are normally distributed. What that means is that if we gather a very large number of
samples of a particular measurement (e.g., height) and construct a frequency
distribution, the result will be normal, that is, will manifest the
characteristics listed above.
Note carefully that the normal curve is
a theoretical depiction of the distribution of frequencies of the values. It does not tell us that in any particular
series of measurements of a normally distributed item half must lie above and half
below the mean. It indicates that there
is a .5 probability that in any series of values, any particular score will lie
above or below the mean and that the average will fall in the centre of the
distribution. Or, put another way, in
any measurement of a heritable characteristic (height, intelligence, weight,
and so on) 50 percent of the population will be below the arithmetical average
(the mean), because such characteristics are normally distributed. It is not the case that in any distribution
exactly 50 percent of the population will fall below the mean—but that must be
the case if the frequency distribution is a normal curve.
Not all values are normally distributed
(please remember that): for example, the salaries of those working at Malaspina
University-College, the responses to a public opinion questionnaire, levels of
contaminant in the Georgia Strait. But
what makes this particular frequency distribution so important is that a great
many things in our world are normally distributed (e.g., population heights,
mortality rates, stock market fluctuations, yearly temperature averages, girth
of trees, all repeated human measurements of a single natural phenomena,
heritable characteristics, and so on).
It is an enormously useful and important analytical concept (2).
We have noted above some of the
properties of the normal curve (most frequent value is at the centre, symmetry
about the central value, diminishing frequency with the distance from the
centre). However, there are many more
important features.
You may have noticed that the shape of
the curve in a normal distribution has a clear point on each side where the
slope goes from concave (bulging outward) to concave (bending inwards). If you were walking up the curve you would
notice that at first the slope increases, but at a particular point it would
begin to decrease as you approach the summit.
The point at which this occurs is called the point of inflection.
If one draws a perpendicular line from the
points of inflection, one on either side of the mean, to the base line (the
X-axis) then the distance from that point to the value of the mean on the
X-axis(in the centre) is equal to the standard deviation. Make sure you understand this very important
property of the normal curve.
Note that these two perpendicular lines
drawn from the points of inflection on either side of the mean divide the area
under the curve further, so that we now have four separate areas, as follows
(see diagram on the next page):
1.
The area
between the mean and one standard deviation above the mean (Area A);
2.
The area
between the mean and one standard deviation below the mean (Area B);
3.
The area
to the right of one standard deviation above the mean (Area C);
4.
The area
to the left of one standard deviation below the mean (Area D).
Since the normal curve is perfectly
symmetrical, Area A will equal Area B, and Area C will equal Area D. And the total of A, B, C, and D will equal
the total area under the curve (i.e., the entire population). Since the curve never quite touches the
X-axis at either end, there may be a value beyond the tails (a highly
improbable value), but its frequency will be so low that we can virtually
ignore it.
Mathematical calculations indicate that
in any normal distribution, no matter
what its height or width, about 68 percent of all the observations fall
within one standard deviation from the mean (i.e., in Areas A and B
combined). Thus, 34 percent will lie
between the mean and 1 standard deviation above the mean (in Area A), and 34
percent between the mean and 1 standard deviation below the mean (in Area
B). Hence, in a normal distribution 32
percent of the observations will fall outside 1 standard deviation, 16 percent
on either side (i.e., 16 percent of the population will fall in Area C and 16
percent in area D).
We may express this, more
appropriately, in the language of probability, as follows: in any normal
distribution, there is approximately a .68 probability that a particular value
will fall within 1 standard deviation (SD) of the mean; there is approximately
a .34 probability that a particular value will lie between the mean and 1 SD
above the mean (in Area A) and approximately a .34 probability that a
particular value will lie between the mean and 1 SD below the mean (in Area
B). Similarly, there is approximately a
.16 probability that a particular value will lie higher than 1 SD from the mean
(in Area C), and approximately a .16 probability that a particular value will
lie lower than 1 SD below the mean (in Area D).
The diagram below illustrates the areas
under the normal curve for one and two standard deviations above and below the
mean (i.e., this is the same as the previous diagram, except that the vertical
lines indicating two standard deviations from the mean have been added to it,
thus creating six areas under the curve).
The vertical lines represent the mean
(at the centre), and distances of 1 and 2 standard deviations on either side of
the mean. As before, Area A and Area B
are equal, each defined by the mean and 1 standard deviation on either side of
it. Each of these areas (A and B)
contains approximately 34 percent of all the values in a normal distribution.
Area C and Area D, which are also
equal, are defined by the vertical lines representing 1 and 2 standard
deviations from the mean (on either side).
Each of these areas will contain approximately 13.5 percent of all the
values in a normal distribution.
Areas E and F, at the extreme ends of
of the curve are defined as the areas marked off by the vertical line
representing 3 standard deviations and the tail ends of the curve. Each of these areas will contain 2.5 percent
of all the values in a normal distribution (i.e., in a normal distribution, 5
percent of the population will be beyond 2 standard deviations: 2.5 above the
mean, and 2.5 below the mean).
If we continued to draw standard
deviation vertical lines to mark off three standard deviations from the mean
(not shown on the diagram), we would have two very small areas at the extreme
tips of the curve it indicate the values lying more than three standard
deviations from the mean. This area
contains .3 percent of all the values in the normal distribution.
The same information given in the above
paragraphs in terms of percentages can be restated in the language of
probability as follows:
1.
In any
normal distribution, there is a .34 probability that any particular value will
fall between the mean and 1 standard deviation above the mean (in Area A), a
.34 probability that any particular value will fall between the mean and 1
standard deviation below the mean (Area B); furthermore, there is a .135
probability that any particular value will fall between 1 and 2 standard
deviations above the mean (Area C) and a .135 probability that any particular
value will fall between 1 and 2 standard deviations below the mean (Area
D). Finally, there is a .475
probability that any particular value will fall within 2 standard deviations
above the mean (somewhere in Areas A and C) and a .475 probability that any
particular value will fall within 2 standard deviations below the mean
(somewhere within Areas B and D).
2.
Further
analysis of the mathematics of normal curves reveals that the area contained by
the perpendicular lines representing 3 standard deviations from the mean
contains 99.7 percent of the area under the curve and thus represents 99.7
percent of all the scores in the data set.
In other words, there is a 99.7 percent chance (or p = .997) that in any normal distribution, any particular value
will fall within 3 standard deviations from the mean (3).
3.
Thus, the
areas beyond three standard deviations contain only .30 percent of the total
area. This means that in a normally
distributed characteristic, the probability of a value lying more than three
standard deviations from the mean is .003, or .0015 at the top end (above the
mean) and .0015 at the bottom end (below the mean). Thus, it is very rare indeed (but not impossible) for an observed
value in a normal distribution to occur more than 3 standard deviations from
the mean.
This mathematical information about a
normal curve provides enormously valuable information. For if we know that a population is normally
distributed (i.e., that the frequency distribution in the population follows a
normal curve), then if we know the mean of that curve and the standard
deviation, we know the probabilities of any particular value falling within
specified areas of the curve. We can
thus make some important predictions about that population.
For instance, suppose we know that the
height of men in a population (say, in Prince George) is normally distributed,
that the mean height (from a sample we collect) is 68 in., and the standard
deviation is 4 in. We then know the
probabilities for the distribution of heights in Prince George, as follows:
Approximately 34 percent of the men
will be between 68 in. (the mean) and 72 in. (1 SD above the mean, 68 + 4);
approximately 34 percent will be between 68 in. (the mean) and 64 in. (1 SD
below the mean, 68 - 4); approximately 13.5 percent will be between 68 in. and
76 in. (between 1 SD above the mean and 2 SD above the mean); and approximately
13.5 percent will be between 64 in. and 60 in. (between 1 SD and 2 SD below the
mean); and approximately 2.5 percent will be between 76 in. and 80 in. (between
2 and 3 SD above the mean); and approximately 2.5 percent will be between 60
in. and 56 in. (between 2 SD and 3 SD below the mean).
Thus, if a child of yours informs you
that she is getting married to some man from Prince George, you already know
some important things about your prospective son-in-law, even though you have
never met.
There is a .34 probability that his
height will be between 68 in. and 72 in.; there is a .34 probability that his
height will be between 68 in. and 64 in.; or, putting these two together, that
there is a .68 probability that his height is between 64 in. and 72 in.
We could obviously continue this
analysis to take into account all the percentage frequencies indicated by the
normal curve.
Now, this mathematical analysis of the
normal curve holds for the frequencies of any value which is normally
distributed. Once we know the mean and
the standard deviation, we are able to predict the probability of the value for
any particular member of the population.
And this process is possible, to repeat the point, for any measurable
factor whose frequencies are normally distributed (e.g., mortality rates, some
test scores, volume of wood in trees, and so on). Thus, once we know that a characteristic is normally distributed,
what the values are for the mean and the standard deviation, we are in a position
to make a number of conclusions about the probable distribution of the entire
population.
It is vitally important for an initial
understanding of statistics to grasp the point that the features of the normal
curve apply to all distribution frequencies of normally distributed items. Normal curves may have many different
heights and widths, but in all cases, these characteristics apply:
1.
The mean,
median, and mode coincide at the high point of the curve and divide the results
into two equal and perfectly symmetrical halves.
2.
Of all
the scores in a perfectly normal distribution, approximately 34 percent will
lie between the mean and 1 Standard Deviation above the mean, and approximately
34 percent will lie between the mean and 1 Standard Deviation below the mean.
3.
Of all
the scores in a perfectly normal distribution, approximately 95 percent will
fall between the lines representing 2 Standard Deviations from the mean (i.e.,
about 27 percent of all scores will fall between 1 and 2 standard deviations,
with 13.5 percent on either side of the curve).
4.
Of all
the scores, approximately 99 percent will lie between the lines indicating 3
standard deviations from the mean (i.e., approximately 5 percent of the sample
will fall between 2 and 3 standard deviations, or approximately 2.5 percent on
either side of the mean).
Note that these characteristics hold
for any normal distribution regardless of the height or width of the normal
curve. Thus, once we know that the
frequencies of a particular mathematical measurement is normally distributed,
we know that the above groupings of the results should occur in any very large
sample.
1.
The
duration times of a certain brand of battery are normally distributed, with a
mean of 80 hours and a standard deviation of 10 hours. As a marketing gimmick, the manufacturer
decides to guarantee to replace any battery which fails prior to a certain
time. Approximately how long a guarantee
should the company provide so that no more than 2.5 percent of the batteries
fail prior to the guaranteed time?
2.
You have
a contract to make one thousand uniforms for the Canadian navy. The heights of sailors are normally
distributed, with a mean of 69 inches and a standard deviation of 2 inches. What percentage of the uniforms will have to
fit sailors shorter than 67 inches?
What percentage will have to be suitable for sailors taller than 73
inches?
3.
Let us
assume the results from all large tests are normally distributed. In the final results for Subject A, the mean
percentage score is 80 and the Standard Deviation 5. In Subject B, the mean percentage score is 70 and the Standard
Deviation 2.5. Suppose you score 75
percent in both courses. What
percentage of students received results better than you in Subject A and in
Subject B? What is your percentile rank
in each subject?
It is important to grasp the point that the
bell-like shape of a normal distribution only occurs with a great many samples from
normally distributed data. In fact, in
any quality normally distributed (e.g., any heritable quality, like height), as
Bernouilli’s Theorem tells us, the frequency distribution of the results will
get closer and closer to the shape of a normal distribution as we increase the
number of measurements (i.e., data in the sample).
To follow this point more clearly, consider the
following diagrams. They represent the
frequency distributions of random numbers taken from a population of numbers
which is known to be normally distributed (Excel generated the numbers and
produced the charts). In this case, the
mean of the total population is 10 and the standard deviation 3 (chosen
arbitrarily).
The first diagram illustrates the frequency
distribution for a sample of 100 numbers.
You will notice that it does not look very bell-like. The second diagram illustrates the frequency
distribution for a sample of 1000 numbers.
You can see that the characteristic shape of the normal distribution is
beginning to emerge.
The final two diagrams illustrate the frequency
distributions for samples of 2000 and 3000 numbers respectively. Clearly, the final diagram, although still
not a perfect bell curve, approximates much more closely than any of the others
the characteristic shape of the normal distribution. A larger sample (say, 10,000) would look even closer to the
symmetrical bell shape.
When we are dealing with random number generation from
a population which is not normally distributed but which is uniformly random,
then increasing the number in the sample is not going to produce more and more
closely any clear shape.
Below, for example, are histograms for 1000 and
for 2000 numbers between 1 and 400 randomly generated, but this time from a
population which is not normally distributed.
Notice that there is no emerging bell curve shape as one increases the
number of samples from 1000 to 2000.
It is particularly important that you
take away from this section and the previous sections a clear sense of the
meanings of the following key terms: mean, standard deviation, z-score (positive and negative), normal
distribution, normal curve.
In addition, you must retain a clear
sense that knowing the standard deviation and the mean of a certain normal
curve enables one to ascertain the probability that certain results will fall
within a certain distance of the mean.
Furthermore, from now on we assume that
students are all familiar with the concept that the area under the normal curve
indicates the theoretical distribution of frequencies in any normally
distributed data. Various areas under
the curve represent the various probabilities that any one score will fall
within the designated area. Thus, the
smaller the area for any group of scores, the smaller the probability that any
score in that group will occur. The
tails of the curve (beyond 3 standard deviations) contain very small areas, and
thus the probabilities of scores within those areas are very low (less than
.01).
As a rough guide, remember that the
majority (approximately 68 percent) of all scores in a normal distribution
should fall within 1 standard deviation and the mean (or between a z-score of +1 and -1); almost all (95
percent of the scores) should fall between the mean and 2 standard deviations
(or between a z-score of +2 and -2),
and that the probability of a score falling within 3 standard deviations and
the mean is approximately 100 percent.
This does not mean that it
is impossible for a score in a normal distribution to fall further than 3 SD
from the mean, simply that such a result is very rare (the value of p is close to 0).
Remember, too, that these characteristics
refer only to data which is normally distributed. These figures do not apply in other sorts of distributions (in
which the shape of the frequency curve will be different).
You will understand very little of what
comes in the next sections if you have not grasped clearly the above
information.
1.
The
manufacturer does not want to return more than 2.5 percent of his
batteries. Since the lifetime of the
batters is normally distributed, we know that 95 percent of them will fall with
2 standard deviations of the mean, that is between 80 + 2SD and 80 - 2SD, or 80
+ 20 and 80 -20, or between 100 hr and 60 hr.
Thus, 5 percent of the population of batteries will fall outside this
range, 2.5 percent above and 2.5 percent below. We are not worried about the batteries above this range, because
owners are not going to complain about batteries lasting longer; the area of
the population we are concerned with is the 2.5 percent below 2 standard
deviations (i.e., below 60 hr).
Therefore, the manufacturer should set his guarantee at 60 hr.
2.
Sailors
shorter that 67 inches fall into an area of the normal curve from the lower
extremity to the line making 1 SD below the mean (since the mean is 69 in. and
the Standard Deviation 2 in.). In a
normal distribution, the area to the left of 1SD below the mean is
approximately 16 percent of the total population. Similarly, sailors taller than 73 in fall into an area 2 SD to
the right of the mean. In a normal
distribution, the area more than 2 SD to the right of the mean is equal to 2.5
percent of the total population.
3.
In
Subject A your score of 75 is 5 marks below the Standard Deviation (of 5). This is equivalent to 1 Standard Deviation
below the mean (or a z-score of
-1). Since the marks are normally
distributed, the percentage of students getting better marks than you includes
the entire population to the right of one Standard Deviation below the mean, or
84 percent. In Subject B your mark of
75 is 5 percent above the mean (or a z
score of 2, since the Standard Deviation is 2.5). Thus, the students who did better than you are those in the area
to the right of two Standard Deviations above the mean, or 2.5 percent. The percentile rank is the percentage of
students who fared worse than you.
Thus, in the first test, you have a percentile score of 16; in the
second test you have a percentile score of 97.2.
(1)
The adjective normal does not mean "usual" or "customary"
(although such a distribution is, in fact, quite common), but comes from
"normative," meaning ideal. [Back to Text]
(2)
The credit for first recognizing and developing the properties of the normal
curve is generally given to the English mathematician Abraham de Moivre, 1667
to 1754, an acquaintance of Newton's and a member of the Royal Society, who
used as his statistical laboratory the London coffee houses where all sorts of
gambling went on. The basic principle underlying
Normal Distribution is that any data which are influenced by many small and
unrelated random effects (like, for example, weight) are going to be normally
distributed (at least to a very near approximation). This principle is called the Central Limit Theorem. See Appendix E for an illustration of how
combining independent random effects produces a normal distribution. [Back to Text]
(3) These percentage figures are
approximate. The more exact figures are
as follows: the area between the mean and one standard deviation contains 34.13
percent of all results on either side of the mean; the area between the mean
and two standard deviations contains 47.72 percent of all results on either
side of the mean; the area between the mean and three standard deviations
contains 49.87 percent of all results on either side of the mean. For a complete lay out of the area under the
normal curve at different standard deviations see Table A in Appendix B. For the purpose of our exercises we will use
the approximate values given above, except where noted. [Back to Text]
[Back to
johnstonia Home Page]