I’ll
Give You a Definite Maybe
An Introductory Handbook on Probability,
Statistics, and Excel
[This handbook has been prepared by Ian Johnston of Malaspina University-College, Nanaimo, BC, for Liberal Studies students. The text here is in the public domain, released May 2000]
In the last section we examined how, on
the basis of a relatively small sample, we can, with differing levels of
confidence, draw conclusions about the population in general. In this section we examine, as our last
section on statistics, the very important business of testing performance
claims about an entire population, by sampling that population and drawing
inferences from the sample about the claim.
Suppose for example, a manufacturer
makes a certain claim about a product (e.g., that it will last for X years
without trouble, that it will effectively cure headaches, that it will live up
to clear specifications, and so on).
How do we test such a claim from a small sample, and how certain should
we be of our results? When are we entitled
to dismiss that manufacturer's claim and when should we accept it?
Or again, suppose we want to compare
two populations to ascertain whether there is a real difference between them
(e.g., the skills of males and females in a particular area, or the effects of
different medicines in the treatment of a particular disease, or the
effectiveness of different methods for educating elementary students in
reading, or the marking standards of two or more different Liberal Studies
instructors). In such inquiries, we
will need to take a sample of each population and then compare the
results. On the basis of an analysis of
each sample and the conclusions we reach about the populations the samples come
from, we should be able to determine within certain confidence levels whether
the difference between the populations is real (i.e., males do not perform as
well as females in this area or vice versa, there is a real difference between
medicine A and Medicine B in the treatment of this disease, or one method of
teaching reading is clearly better than the other) or whether there is no real
difference between the two (or more) things being compared.
The above remarks introduce the point
that an important part of statistics is testing the claims made about a
population or about two allegedly different populations. These claims typically take one of the two
following forms:
1.
This
product will live up to these specifications (e.g., the brand of light bulb
will last for 300 hr., this car will get 55 miles per gallon, this pill will
improve your memory, and so on).
2.
Group A
is significantly different from Group B in respect to X (e.g. the urban poor
suffer more from drug addiction than more affluent suburbanites; women handle
stress better than men; children who learn to read through phonics are better
readers than children who use other methods in reading instruction, Liberal
Studies Instructor X's grades are higher than Liberal Studies Instructor Y's
grades, and so on).
A claim of this type obviously might be formed negatively, in the following
form: There is no difference between Group A and Group B in respect to
attribute X (e.g., the claim that women are not as good as men in mathematics
is wrong: men and women are equal in this respect).
The method of testing such claims
begins by collecting the relevant samples.
For instance, in testing a claim of the Form A above, we take a random
sample of the product (more than 30), and in case of a claim of the Form B
above we take a random sample of the two populations being compared.
The next stage involves forming two
hypotheses, as follows:
1.
The Null Hypothesis is, in the case of
Group A claims, the assertion that there is no difference between the sample we
have collected and the general population established by the claim (for
examples see below). In the case of the
second type of claim, the Null Hypothesis is that there is no difference
between the two populations in the study (e.g., between men and women, between
the urban poor and the suburbanites, between Liberal Studies instructors, and
so on).
2.
The Alternative Hypothesis is, in respect
to the first sort of claim, the assertion that the sample we have taken belongs
to a population different from that described in the original claim (thus the
claim is suspect). In respect to the
second type of claim, the alternative hypothesis is that the samples indicate
that the two populations under scrutiny are, indeed, different.
The testing of the original claim is an
attempt to refute the Null Hypothesis. If we can refute the Null Hypothesis,
then we will affirm the Alternative Hypothesis. If, however, we fail to refute the Null Hypothesis, then we
accept it.
Note the basis of this method. If we discover that, in the first type of
claim, the sample comes from the population described in the claim, or, in the
second type of claim, that the samples we have collected probably come from the
same population, then the Null Hypothesis is upheld (no difference); if our
samples do not satisfy the requirement that they probably come from the same
population, then the Null Hypothesis is refuted, and we can uphold the
Alternative Hypothesis.
Let us consider a test of a claim of
the first type. Suppose, in an attempt
to justify their demands for higher wages on the ground of increased
productivity, the employees in a factory report that, on the average, the
workers complete an individual task in 13 minutes. As a general manager, what can you conclude from a study of 400 workers
which shows an average completion time of 14.25 minutes, with a standard
deviation of 10 minutes?
The Null Hypothesis here is that there
is no difference between the claim made by the workers and the results of the
survey. In other words, they are
claiming that, although the average completion time in the sample is higher
than their claim, the difference between that sample average and their claim is
not significant. The Alternative
Hypothesis is that there is a difference and that, therefore, the workers'
claim is not valid (in other words, the Alternative Hypothesis maintains that
the difference between the sample average and the workers’ claim is
significant). So the question becomes
the following: What is the probability that the results in the sample come from
the same population as that established in the workers' claim?
We begin by calculating the standard
error of the sample. Remember that this
will tell us the standard deviation in the normal curve of the averages of all
the samples we could take. The standard
error is given by the following formula:
Thus,
The standard error here is 10 divided
by 20 or 0.5 minutes.
Now, the workers claim that the average
length for an individual task is 13 min.
The sample gave the result for the same task of 14.25 min. Thus, the difference here between the sample
mean and the mean in the claim is 1.25 minutes. So how significant is a difference of 1.25 minutes?
The standard error of 0.5 is the
standard deviation in the curve of sample means. So the difference of 1.25 minutes is equivalent to 1.25 divided
by 0.5, or 2.5 standard errors, or, alternatively put, the difference of 1.25
minutes has a z-score of 2.5 in the
normal curve of sample means.
A z-score
of 2.5 lies beyond two standard deviations.
And we know that approximately 95 percent of all results normally
distributed will occur within 2 standard deviations of the mean (i.e., for z-scores less than 2). Furthermore, we know that of this 5 percent
of the results which lie beyond 2 standard deviations, 2.5 will be above the
mean and 2.5 will be below the mean.
Thus, in this case, we know that the results of the sample indicate that
the sample falls in an area with only a 2.5 percent probability (p = 0.025), that is, the area to the
right of the mean more than 2 standard deviations away from the mean.
This means, in effect, that there only
a 2.5 percent chance (p = .025) that our sample comes from the same
distributions as that described in the workers' claim. Therefore, we can say with approximately 98
percent certainty that the sample we took is not from the same population as
that established by the workers' claim and that, therefore, the Null Hypothesis
is not valid and that the Alternative Hypothesis is upheld. Thus, we conclude that the workers' claim is
false.
A dealer selling batteries makes the
claim that the average life of her product is 50 hours. We take a random sample of the product (100
batteries) and discover that the average life of this sample is 49 hr, with a
standard deviation of 5 hr. Does this
entitle us to dismiss the dealer's claim as bogus or does it have no effect on
the dealer's claim?
The Null Hypothesis here is that the
sample we have taken comes from the same population of batteries as those
described in the dealer's claim (i.e., those with a life expectancy of 50
hr). The Alternative Hypothesis is that
our sample reveals that the population of batteries from which we took the sample
is different from the one described in the dealer's claim (and that therefore
the dealer's claim about her batteries is false).
Now, as we saw in the last section, we
can calculate the standard error of the mean (the standard deviation of the
sample divided by the square root of the number of samples), or 5 divided by
the square root of 100, or 5 divided by 10, or 0.5 hr.
So now we can evaluate the dealer's
claim. She states that the mean life of
her batteries is 50 hr. How confident
can we be that the result we obtained (49 hr) refutes that claim?
To estimate that probability we need to
find out the distance an average of 49 hr is from the mean time stated by the
dealer. That value is 1 hr below the
dealer's stated mean, or 1 unit to the left of the dealer's mean on the normal
curve (or -1 hr).
What is the z-score for this -1 hr value?
Well, the z-score is the distance from the mean in standard deviation
units. So to obtain that, we divide -1
by the standard error. If we divide -1
by .5, we get a z-score of -2.
But we know from our study of the
normal curve that a z-score of -2 or
lower (that is, a score falling 2 or more standard deviations to the left of
the mean) marks off an area indicating approximately 2.5 percent at the lower
end from the rest of the curve. Thus,
the mean of our sample falls in the tail end of the normal curve representing
all the sample means, in an area representing frequency values of 2.5 percent.
Therefore, there is a 2.5 percent
probability (or p = .025) that the
Null Hypothesis is correct and that the dealer's claim is true. There is a 97.5 percent probability (p = .975) that the Alternative
Hypothesis is correct and that the dealer's claim is false.
Or, alternatively put, the risk of
dismissing the Null Hypothesis is 2.5 percent (p = 0.025). If we do
dismiss the Null Hypothesis (and thus the dealer's claim), we have a .025
chance of being wrong.
Is that a risk worth taking? That will depend on how certain we want to
be before making the decision. In other
words, we will need to set a level of significance.
The level of risk we are willing to set
in order to pass judgement or withhold judgement on the Null Hypothesis depends
upon the level of significance we
wish to set (i.e., the size of the risk we are prepared to take or how
stringent we want our test to be).
Statisticians have set arbitrary limits of .05 or .01: that is, a
significance level of .05 for rejecting the Null Hypothesis is not as stringent
as a significance level of .01. The level
of significance will usually be set out in the specifications of the test. The application of these levels of
significance will become clearer in the examples below.
The level of significance we set
indicates how careful we wish to be in making a judgement about the Null
Hypothesis. If the limit is .05, that
means that if there is only a 5 percent or less chance of being wrong (or if we
are 95 percent or more certain), then we shall accept that as decisive. In such a case, our result above about the
batteries (a probability of .025 that the Null Hypothesis is correct) is
significant, and we will dismiss the dealer's claim (since .025 is less than
.05, and the significance level indicates the cut off point, below which we
should not accept a claim with such a low probability).
If, however, we have set a level of .01
probability as our significance point, that means we will not accept a claim
unless its probability is 1 percent or lower (or, alternatively put, until we
are 99 percent or more certain). In
that case, the result of our test of the batteries (p = .025) is above our
confidence level and we would therefore not dismiss the claim.
Notice what we are saying here. The analysis of the sample reveals that
there is only a .025 probability that the population from which our sample was
drawn is the same population as that described in the dealer's claim (batteries
with an average life of 50 hr). If we
are confident enough at a probability of .95 (or 95 percent) then we will
accept this result as significant and assert that our sample establishes that
we are .95 confident that the dealer's claim is erroneous.
If, however, the level of significance
is set at .01, then we will reject the Alternative Hypothesis and accept the
Null Hypothesis (the dealer's claim), because our result of .025 is not below
the level of significance we have set (a much more stringent requirement than
the earlier figure of .05).
Note very carefully that if we decide
that we are not going to take the risk of dismissing the dealer's claim because
we are not sufficiently certain, we have, in effect, accepted the Null
Hypothesis and rejected the Alternative Hypothesis. But that does not mean that we have clearly "proved"
the dealer's claim. After all, we have
established that the probability that the dealer's claim is right is only .025,
or 25 cases out of 1000. We would, as
prudent statisticians accept the dealer's claim but reserve judgement on any
exact determination of the question (i.e., cover our backsides).
F.
Type I and Type II Errors
In any test of such claims like the
battery or the workers' productivity examples above there is generally some
risk that we may be rejecting a true hypothesis; that is, the probability that
we may be wrong is greater than 0.
Rejecting a Null Hypothesis when it is true is called a Type I
Error. On the other hand, we may
decline the risk and accept the Null Hypothesis when it is, in fact,
false. This is called a Type II Error.
The level of significance we set in
deciding whether to accept or reject the Null Hypothesis depends upon which of
these two errors we most wish to avoid.
If there are very serious consequences in a Type I error, then we should
seek to minimize the risk, by setting the level of significance at .01. Such a stringent level means that we are
more likely to accept the Null Hypothesis than we are at the .05 level, since
the relevant z-score will have to be
more than 3 rather than more than 2 (or, using the exact figures, more than
2.58 rather than more than 1.96).
But the less risk we are willing to
take (in order to minimize Type I errors) the more we are likely to fall into
Type II errors. Setting very strict
limits for rejecting the Null Hypothesis will increase the chances that we are
accepting one that is false.
To use an educational analogy: if we
relax our standards of admission, we run the risk of admitting students who are
not academically capable of the particular course of study we administer (Type
II error); however, if we tighten up our entrance standards, we run the risk of
rejecting students who are in fact academically suitable (Type I error). We will decide between these two courses of
entry policy on the basis of which mistake will have the more serious
consequences.
Or, to use another example, suppose in
a courtroom use of a forensic test of DNA, we are using a probability study to
determine the guilt or innocence of an accused by comparing two tissue samples,
one from the body of the victim and one from the defendant's body. In such a case, acceptance of the Null
Hypothesis would confirm the lack of the difference between the two samples
(and thus help lead to a conviction); a rejection of the Null Hypothesis and an
acceptance of the Alternative Hypothesis will lead us to acquit, because the
samples come from different populations (i.e., the defendant's DNA does not
match that found at the scene of the crime).
If we are keen to give the defendant
the full benefit of the doubt (or if we are the defending lawyer), then we will
want to work with a level of significance as generous as possible (e.g., 0.05),
a number that will make it easier for us to reject the Null Hypothesis. Of course, if we do this, we may be
rejecting a hypothesis which is, in fact true (Type I error), making it easier
for a guilty person to get off. If, on
the other hand, we are keen to convict, we want to confirm the Null Hypothesis,
and thus we will set the strictest acceptable level of significance
(0.01). This will, of course, increase
the chances that we may be convicting an innocent person by accepting the Null
Hypothesis, when it is, in fact, false (Type II error).
This business of Type I and Type II
errors should remind us that the sorts of statistical tests we are applying do
not indeed "prove" anything once and for all. The tests are, in effect, a technical device
to determine whether a specific claim (a hypothesis) meets a given standard (a
level of certainty). There will always
be some risk, however slight, that the conclusion we draw from a statistical
test of a particular hypothesis is wrong.
Thus, demonstrating that a hypothesis
has passed a particularly stringent statistical test does not prove the
hypothesis beyond all doubt. It does,
however, indicate to researchers that there may very well be something in the
claim. Similarly, the rejection of a
hypothesis does not finally "disprove" it. Statistical analysis can only indicate that the claim has failed
to meet a given level of certainty.
This point also brings out how, by
apparently manipulating statistics, one can seem both to "prove"
and to "disprove" a particular claim in a single test. For it is clear that at the .05 level of
significance we may be able to reject the Null Hypothesis and accept the
Alternative Hypothesis, while at the same time at a .01 level of significance
we will have to reserve judgement and accept the Null Hypothesis.
In other words, interpreting the
conclusions of such a statistical analysis requires us to know the confidence
level at which the claim is made and to be very careful about accepting
statistical results without such knowledge.
It will be clear from the above example
that once we know the z-score for the
difference between the sample mean and the mean established in the claim, we
know whether or not to reject the Null Hypothesis. For a z-score of 2 or
less tells us that the sample mean lies within 95 percent of the overall
population of sample means. A z-score of less than 3 gives us
approximately 99 percent certainty.
Conversely, a z-score of more than 2 tells us that there is a less than .05
percent probability that this result indicates that the sample mean comes from
the general population. And a z-score of more than 3, indicates less
than a .01 probability (or 1 chance in 100) that the sample mean comes from the
general population.
You will recall that these figures for
the z-score are working
approximations. The accurate figures
(from the table introduced in the last chapter) are as follows:
For a .05 level of significance, the
exact z-score which marks the cut-off
point is 1.64. For a .01 level of
significance the exact z-score which
marks the cut off point is 2.33. Any
result higher than these z-scores
(which one we select will depend upon the level of significance we set) will
require us to reject the Null Hypothesis.
The z-scores appropriate to
other levels of significance may be determined by consulting the table.
These exact z-scores mean the following.
A z-score of +2.33 marks the
point separating 99 percent of the distribution from the 1 percent in the tail
of the curve to the right of that score (a z-score
of -2.33 similarly marks the point separating 99 percent of the distribution
from the 1 percent in the tail of the curve to the left of that score--that, is
below the mean). A z-score of +1.64 indicated the point in the normal curve separating
95 percent of the distribution from the
5 percent at the extreme right-hand end; thus, 95 percent of the distribution
has a lower z-score than +1.64.
These particular z-scores (1.64 and 2.33) are useful when we are interested only in
whether our sample mean is different in one direction from the mean in the
claim (bigger or smaller but not both together). For example, in an investigation of the dealer's claim, we were
not interested in whether the batteries last, on average, longer than the
dealer claims. We wanted to know
whether the batteries failed to live up to the dealer's specifications. In other words, we were interested only in
one half of the distribution curve of the sample means, the lower half. We wanted to know, with 95 percent
certainty, whether that claim was true or not.
To get such 95 percent certainty, we
needed to locate the line which separates the area representing 95 percent of
the normal curve from the lower 5 percent.
That line is given by the z-score
of 1.64. If we wanted 99 percent
certainty, we need the z-score which
indicates the line separating the lowest 1 percent from the rest of the
distribution; that z-score is 2.33.
Similarly, in dealing with the workers'
claim about more productivity, the manager is not interested in whether or not
the workers do the job in less time than they claim. He wants to find out the point at which he can be 95 percent
certain that the mean of the sample is greater than the mean established in the
claim. Once again, the relevant point
for 95 percent certainty is given by a z-score
of 1.64 (and 99 percent certainty by a z-score
of 2.33).
Such tests, in which we are interested
only in one direction in the curve, are called one-tail tests. In them,
the Alternative Hypothesis will involve a statement with the phrase "is
less than" or "is more than," but not both.
In some tests, however, we are
concerned with whether the sample mean is above or below the mean in the claim
(i.e., both directions at once). And
thus we have to be concerned about both ends of the distribution, both those
above and those below the mean in the claim.
Such tests are called two-tail
tests. They involve an Alternative
Hypothesis with a phrase "is greater or less than" or "is
different from."
In such a test, the z-scores which establish the
significance limits are, for .05 probability, 1.96, and for .01 probability
2.58. The z-score of 1.96 leaves 2.5 percent of the distribution at each end
(or a .025 probability at the left end and a .025 probability at the right
end). A z-score of 2.58 leaves 0.5 percent of the distribution at each end
(or a .005 probability at the left end and a .005 probability at the right end
of the curve).
The mathematical procedures for
analyzing one-tail and two-tail tests are the same. The difference comes in the particular z-scores we use to establish different levels of significance.
A report prepared by the economic
research branch of a large Canadian bank maintains that the average annual
family income in northern BC is $8432.
What do you conclude about the validity this figure if a simple random
sample of 400 families in northern BC showed an average income of $8574, with a
standard deviation of $2000? Use a .05
level of significance. Would your
conclusions be any different at a .01 level of significance?
A manufacturer of fishing line advertises
that his product has an average tensile strength of 30 pounds. We took a sample of 100 sections of the
string and tested them. The average
tensile strength of this sample was 28 pounds, with a standard deviation of 12
pounds. Does this enable us to dismiss
the manufacturer's claims? Answer this
question for a significance level of .05 and .01. Think about whether the more appropriate test will be one-tailed
or two-tailed.
Another important aspect of hypothesis
testing in statistics is examining a hypothesis of the second type we discussed
above, one which makes a claim about two apparently different populations. For example, suppose someone makes the claim
that one group (e.g., women) are better at a certain task than another group
(e.g., men), or that Medication A is better at treating a certain disease than
Medication B, or that one population group (e.g., those on welfare) drink more
than another group (e.g., wage earners).
These sorts of claims are made all the time (sometimes in a negative
form, such as the claim that good nutrition has no effect on children's success
in school—that is, that children with poor nutrition fare just as well at
school as children with good nutrition, and so on). Often these claims form the basis for popular opinions and thus
shape social policy. How can we
evaluate these assertions to ascertain whether or not there is any truth to
them?
There are a number of ways of
evaluating such claims. Here we are
concerned with only one, the z-test. In this test we use the z-score
(which, as we know, is a measurement in units of standard deviation), as we did
before, to identify the critical regions of the normal distribution.
Suppose, for instance, we have a memory
drug which we believe will help students perform better on examinations. We wish to test whether or not this drug is
indeed effective.
We begin, as usual, by collecting two
random samples, each of 100 students, all of whom are taking the same
examination. We give all students a
pill, but one group receives the memory pill the other a useless sugar pill
(i.e., a placebo). All 200 students
think they have received the memory pill.
Thus, their psychological expectations are similar.
We collect the results of the examination,
and tabulate a summary as follows:
Group A (memory pill): Mean Score:
62.8; Standard Deviation: 10 marks
Group
B (placebo): Mean Score 60; Standard Deviation: 9 marks
The difference in the Mean Scores
between Group A and Group B is 2.8 marks, in favour of Group A (the memory-pill
group). Is this result significant or
could it be simply a chance difference resulting from these two samples?
Well, to answer this question, we
begin, as usual, by formulating a Null Hypothesis and an Alternative Hypothesis. In this case the Null Hypothesis is that
there is no difference, that both groups represent the same population (i.e.,
that the calculated difference between the sample means is insignificant, and
the Memory Pill is therefore ineffective).
The Alternative Hypothesis is that there is a real difference between
the two groups, that they belong to two different populations, those who
received the memory pill and those who did not, and that therefore the Memory
Pill did have a significant effect.
The first step is to calculate the
standard error for each group. The
standard error is the standard deviation of the sample divided by the square
root of the number in the sample. Since
there are 100 in each sample, the square root in each calculation of the
Standard Error will be 10. Thus the
Standard Error for Group A is 1 mark and the SE for Group B is 0.9 marks.
Now for a conceptual leap. If we took a
number of paired tests, as we did above (with one memory pill group and one
placebo group) we would always have two means to compare (one for Group A and
one for Group B). And by subtracting
one from the other, every pair of samples would give us a figure for the
difference between the two sample means.
If the Null Hypothesis is true, if,
that is, both the memory-pill group and the placebo-group come from the same
population, then we would expect the average difference between sample means to
be 0 (we would expect this from any collection of paired samples from a common
general population). Sometimes the
memory group sample would have a higher mean score, and sometimes the placebo
group would have the higher mean score.
If I consistently established the difference between the means by
subtracting the mean for the placebo group from the mean for the memory pill
group, I would end up with a collection of positive and negative numbers. However, if these groups indeed come from
the sample population, the average of those numbers should be 0 (no
difference).
Now, if we imagine all the possible
numbers for this difference between the sample means, those figures would be
normally distributed. If the population
is the same (i.e., if the Null Hypothesis is true) then, as mentioned above,
the most frequent result should be 0 (no difference between the two means),
with decreasing frequencies in either direction, one indicating the frequency
of cases where the placebo group mean was higher and the other indicating the
frequency of cases where the drug pill group's mark was higher.
Remember that this normal curve is a
theoretical representation of what will be the case if the two population are
the same (if there is no real difference between Group A and Group B)--that is,
if the Null Hypothesis is valid.
Now, if we could calculate the standard
deviation of this normal curve we would know the relationship between the size
of the difference between the two sample means and its probability (as we do
with any normal distribution).
Well, we can calculate that standard
deviation of the normal curve representing the distribution of differences
between the sample means. It will be the standard error of the difference
between the sample means. This is a
combination of their separate standard errors (1).
This SE (Diff) figure of 1.3 marks
tells us that in the normal distribution curve indicating the frequency of the
values for the differences between the sample means, there is approximately a
.68 probability that the difference will fall between 0 (the mean difference)
and 1.3 marks on either side, approximately a .95 probability that such a
difference will fall between the 0 (the mean difference) and 2 standard
deviations or between -2.6 and + 2.6.
Now the difference we observed between
the two sample means is 2.8. The z-score for this difference is 2.8 divided
by 1.3 , or 2.15. In other words, this
value falls between 2 standard Deviations and 3 standard deviations.
Whether this result of a z-score of 2.15 is significant,
therefore, will depend on the level of significance we set. As before, we know that the z-score of 2 includes approximately 95
percent of all possible scores, or that there is a .95 probability that the
result in a normal distributed frequency will fall within a z score of +2 and -2, or, alternatively,
that there is a .05 possibility that any result in a normal distribution will
fall beyond a z score of 2 on either
side of the mean.
In this case, the figure we obtained
for the difference between the means is a z-score
of 2.15. Since that is clearly more
than 2 and since we know the probability of getting a result in the region with
a z-score of more than 2 is .05, we
can conclude with 95 percent certainty that the two samples we have been
studying come from different populations.
Should we then accept or reject the
Null Hypothesis? What we affirm will,
as before, depend upon how much risk we are prepared to take, in other words,
on the confidence level we set. If we
decide that a confidence level of .05 is acceptable, then we shall reject the
Null Hypothesis, conclude that these two samples do, indeed, come from two
different populations, and that the difference between the two is real and
significant. Therefore, we conclude
that the memory pill does have a significant effect.
On the other hand, if we set a more
stringent confidence level of .01, then we cannot reject the Null Hypothesis.
The z-score calculation (of 2.15)
falls well within the limits for a z-score
indicating a .01 level of certainty (2.58).
If we want to be 99 percent sure of our conclusions, we will have to
agree that there is no difference between the memory-pill group and the
placebo-group (2).
K.
Self Test on a z-Test
1. Following exactly the same
method as that demonstrated above, resolve the following question at the .05
significance level.
To
assess the impact of windowless schools on the psychological development of
school children, an anxiety test was given to a class of 40 children in a
windowless school. The same test was
given to a similar class of 30 students in a school with windows. The results of the test are as follows:
Windowless School School with Windows
Number
in sample: 40 Number in sample:
30
Mean
Score: 117 Mean
Score: 112
Standard
Deviation: 10 Standard Deviation:
12
If
we set ourselves a confidence level of .05, can we conclude that there is a
real difference in the anxiety levels of the two populations of students?
2. A study was undertaken to
see whether blond-haired men had a more active dating life than black-haired
men. In a random sample of 100
blond-haired men and 100 black-haired men, the following information was
collected and calculated:
Blond-Haired Men Black-Haired
Men
Dates per month
(mean) 7.5 dates Dates per
month (mean) 6 dates
Standard
Deviation: 4 dates Standard
Deviation: 3 dates
On
the basis of this information, and at a confidence level of 0.01, examine
whether there is a real difference in the dating frequencies of the two
groups. Do blonds have more fun?
. 3. Recently
at Malaspina University-College in Building 355 the loud complaint was heard
echoing down a corridor that Liberal Studies instructors have very different
marking standards and results. To test
this claim, we collected 30 sample grades on main seminar essays from two
Liberal Studies Instructors on the same team.
The results are as follows:
Instructor
A Instructor B
Number
of papers: 30 Number of
papers: 30
Mean
grade (100): 78.17 Mean
grade (100): 78.37
Standard
Deviation: 8.40 Standard
Deviation: 9.63
On
the basis of this information, would you uphold or reject the complaint? Indicate the confidence level of your
conclusion. Note that these figures are
based on hard data from LBST 302 last semester, so your result is an analysis
of what really goes on.
In practice, conducting a z-test is easier that the procedure outlined above, because we can
get Excel to carry out all the mathematics for us. All we have to do is collect the data on the two samples, request
a z-test analysis from the Excel
statistics options, and examine the table of the results. The following paragraphs describe the
procedure.
1.
First
enter in the data for your two samples, one in Column A and the other in Column
B. The number of entries in each column
does not have to be the same, but there must be more than 30 entries in each
column.
2.
Then,
using the Descriptive Statistics tool from the Data Analysis option (on the
Tools Menu), generate the table of Descriptive Statistics for each sample. Note carefully the figure for the Variance
of each sample.
3.
Then from
the Data Analysis option, select the last item: z-test: Two Sample Means.
When you get the dialogue box, in the Variable 1 Range box, indicate the
range of the first sample in Column A (e.g., $A$1:$A$35). In Variable 2 Range, indicate the range of
the second sample in Column B (e.g., $B$1:$B$48). In the Hypothesized Mean Difference Box, enter the figure 0. Since the Null Hypothesis, which we are
attempting to refute, says that both samples come from the same population, we
are hypothesizing that the difference between the means for the two populations
is 0.
4.
In the
Variable 1 Variance (known) box enter the figure for the Variance for the
sample in the A Column (you will find this figure in the Descriptive Statistics
box you generated in the second step described above). In the Variable 2 Variance (known) box enter
the corresponding figure for the Variance of the second sample.
5.
In the
Alpha box the number 0.05 should already appear. Leave this alone for the moment (if it is empty or shows a number
different from 0.05, then enter the number 0.05). The Alpha figure indicates the Confidence Level for this
test. A figure of 0.05 states that you
want to be 95 percent certain of the result or, alternatively put, that you
want the probability of being wrong to be .05 or lower.
6.
In the
Output Range type the number of the cell where you want the Output Table to
appear (or alternatively, with the line active in the Output Range box, click
the mouse on an empty cell). The Output
Range table will take up three columns and twelve horizontal rows.
7.
Then
click on OK. After a couple of seconds,
a table should appear in the place designated by the Output entry. This table has the heading: z-Test: Two Samples for Means. You will need to widen the left hand column
of the table in order to read the names of the items. In table there are figures for the following items: Mean, Known
Variance, Observations, Hypothesized Mean Difference, z, P(Z<=z) one-tail, z
Critical one-tail, P(Z<=z) two-tail, z Critical two tail.
The Mean figures gives the
arithmetical average for each sample.
It should be the same as the figure for the Mean in your Descriptive
Statistics chart generated earlier. The Known Variance similarly gives the Variance
for each sample and corresponds to the Variance figures generated earlier
(these are the figures you entered into the z-Test
dialogue box). The Observations is the number of items in each sample. The Hypothesized
Mean Difference should be 0 (the figure you entered in the z-Test dialogue box earlier).
The z
figure indicates in standard deviation units how far from the mean the figure
for the difference between your two samples is located. Remember that the normal curve for all the
differences between all the possible pairs of samples from the population has a
mean of 0. Your two samples did not
have the same mean; thus they fall away from the mean in the normal
distribution. The z figure tells you how far away the difference falls.
Following the z figure there are four
lines, two concerning one-tail and two concerning two-tail testing. You will use one or the other of these pairs
of figures, not both. The one you use will
depend upon the nature of your Alternative Hypothesis.
If your Alternative Hypothesis makes a claim about a particular difference
between the two populations, then you need the one-tail figures. For instance, an Alternative Claim like
“Women drink more alcoholic beverages than men” or “Instructor A gives higher
Marks than Instructor B” or “People who smoke more than a pack of cigarettes a
day have more heart attacks than people who do not smoke any cigarettes” then
you will be needing the one-tail figures.
Since your Alternative Hypothesis asserts that one of the populations will
have a higher value than the other, then you are interested only in one end of
the distribution curve.
However, if your Alternative Hypothesis simply asserts that there will be a
significant difference between the populations (without asserting which will be
higher or lower), then you need the two tail figures. For example, you will need a two-tailed test for any Alternative
Hypothesis like the following: “There is a difference in the amounts of alcohol
men and women drink,” “Instructor A and Instructor B mark at different
standards,” “Older students’s marks in Liberal Studies are different from
younger students’ marks.” And so
on. Note that the interpretation of the
figures is the same, no matter which of the two your are conducting, but the
figures will be different.
The P(Z<=z) figure indicates the probability that the two populations are
the same. Thus a figure here of, say,
0.03 would indicate that the probability of the Null Hypothesis being correct
(that there is no difference between the populations) is .03 or 3 percent. Alternatively, there is a .97 probability
that the Null Hypothesis is not correct (or
p = .97).
Whether or not the figure for P(Z<=z) enables you to confirm or dismiss the
Null Hypothesis will depend upon the confidence level you set. If your level is .05, then a figure of .03
(smaller than the specification) indicates that the Null Hypothesis should be
rejected and the Alternative Hypothesis affirmed. However, if the Confidence Level you have set is .01, the a
result for P(Z<=z) of .03 (which is higher than the specification) enables
you to affirm the Null Hypothesis.
Another way quickly of determining whether to affirm or reject the Null
Hypothesis is to examine the z Critical figure. This number indicates the value above which
the z figure is too high for one to accept the Null Hypothesis. So to determine whether or not one should
affirm or deny the Null Hypothesis, simply compare the z figure with the z-Critical
figure. If the z figure is less than the z-Critical
figure, one affirms the Null Hypothesis; if the z-figure is greater than the z-Critical
figure, then one rejects the Null Hypothesis.
If you change the Confidence Level in a z-test
of this sort, you will notice that the z-Critical
values will change. For instance, if
you go back and start the z-test over
again, but this time in the Dialogue Box enter a value for Alpha of 0.01
(rather than 0.05), then you are demanding a Confidence Level of 99 percent or,
alternatively, you want a result in which there is only 1 percent chance (p = 0.01) of your being wrong.
If you do that and generate a second z-Test:
Two Sample for Mean table, you will notice that all the values in that table
are the same as for the first table, except for the z-Critical values, which have increased. What that means, of course, as you should understand by now, is
that if I want to be more confident of my result, I have to widen the interval
within which I judge results to be acceptable.
Note that whenever you state a conclusion to a z-test, you must obviously indicate the
confidence level you used. As we have
discussed in the text, a Null Hypothesis which you reject at a .05 value for
Alpha (a confidence of 95 percent), you may have to accept at a .01 value for
Alpha (a confidence of 99 percent).
(1)
The mathematical calculations which justify this formula for the standard error
of the normal curve of the differences between the means we will not go into
here. The principle of the formula,
which is derived from the Central Limit Theorem, is that if we are combining
two normally distributed characteristics, then the variance in the resulting
distribution will be the sum of the variances in the two original frequency
distributions. We add up the two
squared standard errors to get the variance of the new distribution, and then
we take the square root of that total to get the standard deviation of the new
combined frequency distribution. [Back to Text]
(2) If we want to
become more confident of our results than in the above example, we can narrow
the confidence interval by increasing the size of the sample (thus lowering the
value of the Standard Error). However,
as we discussed in the last section, in Section N (p. 69), to achieve a useful
reduction of the Standard Error, we will need a very large increase in the
sample size (to reduce the SE by 50 percent, we must quadruple the sample
size). [Back to Text]