I’ll Give You a Definite Maybe
An Introductory Handbook on Probability, Statistics, and Excel

[This handbook has been prepared by Ian Johnston of Malaspina University-College, Nanaimo, BC, for Liberal Studies students. The text here is in the public domain, released May 2000]

Section Seven: Comparing Samples, Tests of Significance

A. Introduction

In the last section we examined how, on the basis of a relatively small sample, we can, with differing levels of confidence, draw conclusions about the population in general. In this section we examine, as our last section on statistics, the very important business of testing performance claims about an entire population, by sampling that population and drawing inferences from the sample about the claim.

Suppose for example, a manufacturer makes a certain claim about a product (e.g., that it will last for X years without trouble, that it will effectively cure headaches, that it will live up to clear specifications, and so on). How do we test such a claim from a small sample, and how certain should we be of our results? When are we entitled to dismiss that manufacturer's claim and when should we accept it?

Or again, suppose we want to compare two populations to ascertain whether there is a real difference between them (e.g., the skills of males and females in a particular area, or the effects of different medicines in the treatment of a particular disease, or the effectiveness of different methods for educating elementary students in reading, or the marking standards of two or more different Liberal Studies instructors). In such inquiries, we will need to take a sample of each population and then compare the results. On the basis of an analysis of each sample and the conclusions we reach about the populations the samples come from, we should be able to determine within certain confidence levels whether the difference between the populations is real (i.e., males do not perform as well as females in this area or vice versa, there is a real difference between medicine A and Medicine B in the treatment of this disease, or one method of teaching reading is clearly better than the other) or whether there is no real difference between the two (or more) things being compared.

B. The Null and the Alternative Hypothesis

The above remarks introduce the point that an important part of statistics is testing the claims made about a population or about two allegedly different populations. These claims typically take one of the two following forms:

1. This product will live up to these specifications (e.g., the brand of light bulb will last for 300 hr., this car will get 55 miles per gallon, this pill will improve your memory, and so on).

2. Group A is significantly different from Group B in respect to X (e.g. the urban poor suffer more from drug addiction than more affluent suburbanites; women handle stress better than men; children who learn to read through phonics are better readers than children who use other methods in reading instruction, Liberal Studies Instructor X's grades are higher than Liberal Studies Instructor Y's grades, and so on).

A claim of this type obviously might be formed negatively, in the following form: There is no difference between Group A and Group B in respect to attribute X (e.g., the claim that women are not as good as men in mathematics is wrong: men and women are equal in this respect).

The method of testing such claims begins by collecting the relevant samples. For instance, in testing a claim of the Form A above, we take a random sample of the product (more than 30), and in case of a claim of the Form B above we take a random sample of the two populations being compared.

The next stage involves forming two hypotheses, as follows:

1. The Null Hypothesis is, in the case of Group A claims, the assertion that there is no difference between the sample we have collected and the general population established by the claim (for examples see below). In the case of the second type of claim, the Null Hypothesis is that there is no difference between the two populations in the study (e.g., between men and women, between the urban poor and the suburbanites, between Liberal Studies instructors, and so on).

2. The Alternative Hypothesis is, in respect to the first sort of claim, the assertion that the sample we have taken belongs to a population different from that described in the original claim (thus the claim is suspect). In respect to the second type of claim, the alternative hypothesis is that the samples indicate that the two populations under scrutiny are, indeed, different.

The testing of the original claim is an attempt to refute the Null Hypothesis. If we can refute the Null Hypothesis, then we will affirm the Alternative Hypothesis. If, however, we fail to refute the Null Hypothesis, then we accept it.

Note the basis of this method. If we discover that, in the first type of claim, the sample comes from the population described in the claim, or, in the second type of claim, that the samples we have collected probably come from the same population, then the Null Hypothesis is upheld (no difference); if our samples do not satisfy the requirement that they probably come from the same population, then the Null Hypothesis is refuted, and we can uphold the Alternative Hypothesis.

C. An Example of a Performance Claim

Let us consider a test of a claim of the first type. Suppose, in an attempt to justify their demands for higher wages on the ground of increased productivity, the employees in a factory report that, on the average, the workers complete an individual task in 13 minutes. As a general manager, what can you conclude from a study of 400 workers which shows an average completion time of 14.25 minutes, with a standard deviation of 10 minutes?

The Null Hypothesis here is that there is no difference between the claim made by the workers and the results of the survey. In other words, they are claiming that, although the average completion time in the sample is higher than their claim, the difference between that sample average and their claim is not significant. The Alternative Hypothesis is that there is a difference and that, therefore, the workers' claim is not valid (in other words, the Alternative Hypothesis maintains that the difference between the sample average and the workers’ claim is significant). So the question becomes the following: What is the probability that the results in the sample come from the same population as that established in the workers' claim?

We begin by calculating the standard error of the sample. Remember that this will tell us the standard deviation in the normal curve of the averages of all the samples we could take. The standard error is given by the following formula:

Thus,

The standard error here is 10 divided by 20 or 0.5 minutes.

Now, the workers claim that the average length for an individual task is 13 min. The sample gave the result for the same task of 14.25 min. Thus, the difference here between the sample mean and the mean in the claim is 1.25 minutes. So how significant is a difference of 1.25 minutes?

The standard error of 0.5 is the standard deviation in the curve of sample means. So the difference of 1.25 minutes is equivalent to 1.25 divided by 0.5, or 2.5 standard errors, or, alternatively put, the difference of 1.25 minutes has a z-score of 2.5 in the normal curve of sample means.

A z-score of 2.5 lies beyond two standard deviations. And we know that approximately 95 percent of all results normally distributed will occur within 2 standard deviations of the mean (i.e., for z-scores less than 2). Furthermore, we know that of this 5 percent of the results which lie beyond 2 standard deviations, 2.5 will be above the mean and 2.5 will be below the mean. Thus, in this case, we know that the results of the sample indicate that the sample falls in an area with only a 2.5 percent probability (p = 0.025), that is, the area to the right of the mean more than 2 standard deviations away from the mean.

This means, in effect, that there only a 2.5 percent chance (p = .025) that our sample comes from the same distributions as that described in the workers' claim. Therefore, we can say with approximately 98 percent certainty that the sample we took is not from the same population as that established by the workers' claim and that, therefore, the Null Hypothesis is not valid and that the Alternative Hypothesis is upheld. Thus, we conclude that the workers' claim is false.

D. A Second Example: Assault on Batteries

A dealer selling batteries makes the claim that the average life of her product is 50 hours. We take a random sample of the product (100 batteries) and discover that the average life of this sample is 49 hr, with a standard deviation of 5 hr. Does this entitle us to dismiss the dealer's claim as bogus or does it have no effect on the dealer's claim?

The Null Hypothesis here is that the sample we have taken comes from the same population of batteries as those described in the dealer's claim (i.e., those with a life expectancy of 50 hr). The Alternative Hypothesis is that our sample reveals that the population of batteries from which we took the sample is different from the one described in the dealer's claim (and that therefore the dealer's claim about her batteries is false).

Now, as we saw in the last section, we can calculate the standard error of the mean (the standard deviation of the sample divided by the square root of the number of samples), or 5 divided by the square root of 100, or 5 divided by 10, or 0.5 hr.

So now we can evaluate the dealer's claim. She states that the mean life of her batteries is 50 hr. How confident can we be that the result we obtained (49 hr) refutes that claim?

To estimate that probability we need to find out the distance an average of 49 hr is from the mean time stated by the dealer. That value is 1 hr below the dealer's stated mean, or 1 unit to the left of the dealer's mean on the normal curve (or -1 hr).

What is the z-score for this -1 hr value? Well, the z-score is the distance from the mean in standard deviation units. So to obtain that, we divide -1 by the standard error. If we divide -1 by .5, we get a z-score of -2.

But we know from our study of the normal curve that a z-score of -2 or lower (that is, a score falling 2 or more standard deviations to the left of the mean) marks off an area indicating approximately 2.5 percent at the lower end from the rest of the curve. Thus, the mean of our sample falls in the tail end of the normal curve representing all the sample means, in an area representing frequency values of 2.5 percent.

Therefore, there is a 2.5 percent probability (or p = .025) that the Null Hypothesis is correct and that the dealer's claim is true. There is a 97.5 percent probability (p = .975) that the Alternative Hypothesis is correct and that the dealer's claim is false.

Or, alternatively put, the risk of dismissing the Null Hypothesis is 2.5 percent (p = 0.025). If we do dismiss the Null Hypothesis (and thus the dealer's claim), we have a .025 chance of being wrong.

Is that a risk worth taking? That will depend on how certain we want to be before making the decision. In other words, we will need to set a level of significance.

E. Level of Significance

The level of risk we are willing to set in order to pass judgement or withhold judgement on the Null Hypothesis depends upon the level of significance we wish to set (i.e., the size of the risk we are prepared to take or how stringent we want our test to be). Statisticians have set arbitrary limits of .05 or .01: that is, a significance level of .05 for rejecting the Null Hypothesis is not as stringent as a significance level of .01. The level of significance will usually be set out in the specifications of the test. The application of these levels of significance will become clearer in the examples below.

The level of significance we set indicates how careful we wish to be in making a judgement about the Null Hypothesis. If the limit is .05, that means that if there is only a 5 percent or less chance of being wrong (or if we are 95 percent or more certain), then we shall accept that as decisive. In such a case, our result above about the batteries (a probability of .025 that the Null Hypothesis is correct) is significant, and we will dismiss the dealer's claim (since .025 is less than .05, and the significance level indicates the cut off point, below which we should not accept a claim with such a low probability).

If, however, we have set a level of .01 probability as our significance point, that means we will not accept a claim unless its probability is 1 percent or lower (or, alternatively put, until we are 99 percent or more certain). In that case, the result of our test of the batteries (p = .025) is above our confidence level and we would therefore not dismiss the claim.

Notice what we are saying here. The analysis of the sample reveals that there is only a .025 probability that the population from which our sample was drawn is the same population as that described in the dealer's claim (batteries with an average life of 50 hr). If we are confident enough at a probability of .95 (or 95 percent) then we will accept this result as significant and assert that our sample establishes that we are .95 confident that the dealer's claim is erroneous.

If, however, the level of significance is set at .01, then we will reject the Alternative Hypothesis and accept the Null Hypothesis (the dealer's claim), because our result of .025 is not below the level of significance we have set (a much more stringent requirement than the earlier figure of .05).

Note very carefully that if we decide that we are not going to take the risk of dismissing the dealer's claim because we are not sufficiently certain, we have, in effect, accepted the Null Hypothesis and rejected the Alternative Hypothesis. But that does not mean that we have clearly "proved" the dealer's claim. After all, we have established that the probability that the dealer's claim is right is only .025, or 25 cases out of 1000. We would, as prudent statisticians accept the dealer's claim but reserve judgement on any exact determination of the question (i.e., cover our backsides).

F. Type I and Type II Errors

In any test of such claims like the battery or the workers' productivity examples above there is generally some risk that we may be rejecting a true hypothesis; that is, the probability that we may be wrong is greater than 0. Rejecting a Null Hypothesis when it is true is called a Type I Error. On the other hand, we may decline the risk and accept the Null Hypothesis when it is, in fact, false. This is called a Type II Error.

The level of significance we set in deciding whether to accept or reject the Null Hypothesis depends upon which of these two errors we most wish to avoid. If there are very serious consequences in a Type I error, then we should seek to minimize the risk, by setting the level of significance at .01. Such a stringent level means that we are more likely to accept the Null Hypothesis than we are at the .05 level, since the relevant z-score will have to be more than 3 rather than more than 2 (or, using the exact figures, more than 2.58 rather than more than 1.96).

But the less risk we are willing to take (in order to minimize Type I errors) the more we are likely to fall into Type II errors. Setting very strict limits for rejecting the Null Hypothesis will increase the chances that we are accepting one that is false.

To use an educational analogy: if we relax our standards of admission, we run the risk of admitting students who are not academically capable of the particular course of study we administer (Type II error); however, if we tighten up our entrance standards, we run the risk of rejecting students who are in fact academically suitable (Type I error). We will decide between these two courses of entry policy on the basis of which mistake will have the more serious consequences.

Or, to use another example, suppose in a courtroom use of a forensic test of DNA, we are using a probability study to determine the guilt or innocence of an accused by comparing two tissue samples, one from the body of the victim and one from the defendant's body. In such a case, acceptance of the Null Hypothesis would confirm the lack of the difference between the two samples (and thus help lead to a conviction); a rejection of the Null Hypothesis and an acceptance of the Alternative Hypothesis will lead us to acquit, because the samples come from different populations (i.e., the defendant's DNA does not match that found at the scene of the crime).

If we are keen to give the defendant the full benefit of the doubt (or if we are the defending lawyer), then we will want to work with a level of significance as generous as possible (e.g., 0.05), a number that will make it easier for us to reject the Null Hypothesis. Of course, if we do this, we may be rejecting a hypothesis which is, in fact true (Type I error), making it easier for a guilty person to get off. If, on the other hand, we are keen to convict, we want to confirm the Null Hypothesis, and thus we will set the strictest acceptable level of significance (0.01). This will, of course, increase the chances that we may be convicting an innocent person by accepting the Null Hypothesis, when it is, in fact, false (Type II error).

This business of Type I and Type II errors should remind us that the sorts of statistical tests we are applying do not indeed "prove" anything once and for all. The tests are, in effect, a technical device to determine whether a specific claim (a hypothesis) meets a given standard (a level of certainty). There will always be some risk, however slight, that the conclusion we draw from a statistical test of a particular hypothesis is wrong.

Thus, demonstrating that a hypothesis has passed a particularly stringent statistical test does not prove the hypothesis beyond all doubt. It does, however, indicate to researchers that there may very well be something in the claim. Similarly, the rejection of a hypothesis does not finally "disprove" it. Statistical analysis can only indicate that the claim has failed to meet a given level of certainty.

This point also brings out how, by apparently manipulating statistics, one can seem both to "prove" and to "disprove" a particular claim in a single test. For it is clear that at the .05 level of significance we may be able to reject the Null Hypothesis and accept the Alternative Hypothesis, while at the same time at a .01 level of significance we will have to reserve judgement and accept the Null Hypothesis.

In other words, interpreting the conclusions of such a statistical analysis requires us to know the confidence level at which the claim is made and to be very careful about accepting statistical results without such knowledge.

G. A Note on the z-Score: One Tail and Two Tail Tests of Significance

It will be clear from the above example that once we know the z-score for the difference between the sample mean and the mean established in the claim, we know whether or not to reject the Null Hypothesis. For a z-score of 2 or less tells us that the sample mean lies within 95 percent of the overall population of sample means. A z-score of less than 3 gives us approximately 99 percent certainty.

Conversely, a z-score of more than 2 tells us that there is a less than .05 percent probability that this result indicates that the sample mean comes from the general population. And a z-score of more than 3, indicates less than a .01 probability (or 1 chance in 100) that the sample mean comes from the general population.

You will recall that these figures for the z-score are working approximations. The accurate figures (from the table introduced in the last chapter) are as follows:

For a .05 level of significance, the exact z-score which marks the cut-off point is 1.64. For a .01 level of significance the exact z-score which marks the cut off point is 2.33. Any result higher than these z-scores (which one we select will depend upon the level of significance we set) will require us to reject the Null Hypothesis. The z-scores appropriate to other levels of significance may be determined by consulting the table.

These exact z-scores mean the following. A z-score of +2.33 marks the point separating 99 percent of the distribution from the 1 percent in the tail of the curve to the right of that score (a z-score of -2.33 similarly marks the point separating 99 percent of the distribution from the 1 percent in the tail of the curve to the left of that score--that, is below the mean). A z-score of +1.64 indicated the point in the normal curve separating 95 percent of the distribution from the 5 percent at the extreme right-hand end; thus, 95 percent of the distribution has a lower z-score than +1.64.

These particular z-scores (1.64 and 2.33) are useful when we are interested only in whether our sample mean is different in one direction from the mean in the claim (bigger or smaller but not both together). For example, in an investigation of the dealer's claim, we were not interested in whether the batteries last, on average, longer than the dealer claims. We wanted to know whether the batteries failed to live up to the dealer's specifications. In other words, we were interested only in one half of the distribution curve of the sample means, the lower half. We wanted to know, with 95 percent certainty, whether that claim was true or not.

To get such 95 percent certainty, we needed to locate the line which separates the area representing 95 percent of the normal curve from the lower 5 percent. That line is given by the z-score of 1.64. If we wanted 99 percent certainty, we need the z-score which indicates the line separating the lowest 1 percent from the rest of the distribution; that z-score is 2.33.

Similarly, in dealing with the workers' claim about more productivity, the manager is not interested in whether or not the workers do the job in less time than they claim. He wants to find out the point at which he can be 95 percent certain that the mean of the sample is greater than the mean established in the claim. Once again, the relevant point for 95 percent certainty is given by a z-score of 1.64 (and 99 percent certainty by a z-score of 2.33).

Such tests, in which we are interested only in one direction in the curve, are called one-tail tests. In them, the Alternative Hypothesis will involve a statement with the phrase "is less than" or "is more than," but not both.

In some tests, however, we are concerned with whether the sample mean is above or below the mean in the claim (i.e., both directions at once). And thus we have to be concerned about both ends of the distribution, both those above and those below the mean in the claim. Such tests are called two-tail tests. They involve an Alternative Hypothesis with a phrase "is greater or less than" or "is different from."

In such a test, the z-scores which establish the significance limits are, for .05 probability, 1.96, and for .01 probability 2.58. The z-score of 1.96 leaves 2.5 percent of the distribution at each end (or a .025 probability at the left end and a .025 probability at the right end). A z-score of 2.58 leaves 0.5 percent of the distribution at each end (or a .005 probability at the left end and a .005 probability at the right end of the curve).

The mathematical procedures for analyzing one-tail and two-tail tests are the same. The difference comes in the particular z-scores we use to establish different levels of significance.

H. Self-Test on a Population Claim (a Two-Tail Test)

A report prepared by the economic research branch of a large Canadian bank maintains that the average annual family income in northern BC is $8432. What do you conclude about the validity this figure if a simple random sample of 400 families in northern BC showed an average income of $8574, with a standard deviation of $2000? Use a .05 level of significance. Would your conclusions be any different at a .01 level of significance?

I. Self-Test on Testing a Specification Claim

A manufacturer of fishing line advertises that his product has an average tensile strength of 30 pounds. We took a sample of 100 sections of the string and tested them. The average tensile strength of this sample was 28 pounds, with a standard deviation of 12 pounds. Does this enable us to dismiss the manufacturer's claims? Answer this question for a significance level of .05 and .01. Think about whether the more appropriate test will be one-tailed or two-tailed.

J. Comparing Two Sample Means: a z-Test

Another important aspect of hypothesis testing in statistics is examining a hypothesis of the second type we discussed above, one which makes a claim about two apparently different populations. For example, suppose someone makes the claim that one group (e.g., women) are better at a certain task than another group (e.g., men), or that Medication A is better at treating a certain disease than Medication B, or that one population group (e.g., those on welfare) drink more than another group (e.g., wage earners). These sorts of claims are made all the time (sometimes in a negative form, such as the claim that good nutrition has no effect on children's success in school—that is, that children with poor nutrition fare just as well at school as children with good nutrition, and so on). Often these claims form the basis for popular opinions and thus shape social policy. How can we evaluate these assertions to ascertain whether or not there is any truth to them?

There are a number of ways of evaluating such claims. Here we are concerned with only one, the z-test. In this test we use the z-score (which, as we know, is a measurement in units of standard deviation), as we did before, to identify the critical regions of the normal distribution.

Suppose, for instance, we have a memory drug which we believe will help students perform better on examinations. We wish to test whether or not this drug is indeed effective.

We begin, as usual, by collecting two random samples, each of 100 students, all of whom are taking the same examination. We give all students a pill, but one group receives the memory pill the other a useless sugar pill (i.e., a placebo). All 200 students think they have received the memory pill. Thus, their psychological expectations are similar.

We collect the results of the examination, and tabulate a summary as follows:

Group A (memory pill): Mean Score: 62.8; Standard Deviation: 10 marks

Group B (placebo): Mean Score 60; Standard Deviation: 9 marks

The difference in the Mean Scores between Group A and Group B is 2.8 marks, in favour of Group A (the memory-pill group). Is this result significant or could it be simply a chance difference resulting from these two samples?

Well, to answer this question, we begin, as usual, by formulating a Null Hypothesis and an Alternative Hypothesis. In this case the Null Hypothesis is that there is no difference, that both groups represent the same population (i.e., that the calculated difference between the sample means is insignificant, and the Memory Pill is therefore ineffective). The Alternative Hypothesis is that there is a real difference between the two groups, that they belong to two different populations, those who received the memory pill and those who did not, and that therefore the Memory Pill did have a significant effect.

The first step is to calculate the standard error for each group. The standard error is the standard deviation of the sample divided by the square root of the number in the sample. Since there are 100 in each sample, the square root in each calculation of the Standard Error will be 10. Thus the Standard Error for Group A is 1 mark and the SE for Group B is 0.9 marks.

Now for a conceptual leap. If we took a number of paired tests, as we did above (with one memory pill group and one placebo group) we would always have two means to compare (one for Group A and one for Group B). And by subtracting one from the other, every pair of samples would give us a figure for the difference between the two sample means.

If the Null Hypothesis is true, if, that is, both the memory-pill group and the placebo-group come from the same population, then we would expect the average difference between sample means to be 0 (we would expect this from any collection of paired samples from a common general population). Sometimes the memory group sample would have a higher mean score, and sometimes the placebo group would have the higher mean score. If I consistently established the difference between the means by subtracting the mean for the placebo group from the mean for the memory pill group, I would end up with a collection of positive and negative numbers. However, if these groups indeed come from the sample population, the average of those numbers should be 0 (no difference).

Now, if we imagine all the possible numbers for this difference between the sample means, those figures would be normally distributed. If the population is the same (i.e., if the Null Hypothesis is true) then, as mentioned above, the most frequent result should be 0 (no difference between the two means), with decreasing frequencies in either direction, one indicating the frequency of cases where the placebo group mean was higher and the other indicating the frequency of cases where the drug pill group's mark was higher.

Remember that this normal curve is a theoretical representation of what will be the case if the two population are the same (if there is no real difference between Group A and Group B)--that is, if the Null Hypothesis is valid.

Now, if we could calculate the standard deviation of this normal curve we would know the relationship between the size of the difference between the two sample means and its probability (as we do with any normal distribution).

Well, we can calculate that standard deviation of the normal curve representing the distribution of differences between the sample means. It will be the standard error of the difference between the sample means. This is a combination of their separate standard errors (1).

This SE (Diff) figure of 1.3 marks tells us that in the normal distribution curve indicating the frequency of the values for the differences between the sample means, there is approximately a .68 probability that the difference will fall between 0 (the mean difference) and 1.3 marks on either side, approximately a .95 probability that such a difference will fall between the 0 (the mean difference) and 2 standard deviations or between -2.6 and + 2.6.

Now the difference we observed between the two sample means is 2.8. The z-score for this difference is 2.8 divided by 1.3 , or 2.15. In other words, this value falls between 2 standard Deviations and 3 standard deviations.

Whether this result of a z-score of 2.15 is significant, therefore, will depend on the level of significance we set. As before, we know that the z-score of 2 includes approximately 95 percent of all possible scores, or that there is a .95 probability that the result in a normal distributed frequency will fall within a z score of +2 and -2, or, alternatively, that there is a .05 possibility that any result in a normal distribution will fall beyond a z score of 2 on either side of the mean.

In this case, the figure we obtained for the difference between the means is a z-score of 2.15. Since that is clearly more than 2 and since we know the probability of getting a result in the region with a z-score of more than 2 is .05, we can conclude with 95 percent certainty that the two samples we have been studying come from different populations.

Should we then accept or reject the Null Hypothesis? What we affirm will, as before, depend upon how much risk we are prepared to take, in other words, on the confidence level we set. If we decide that a confidence level of .05 is acceptable, then we shall reject the Null Hypothesis, conclude that these two samples do, indeed, come from two different populations, and that the difference between the two is real and significant. Therefore, we conclude that the memory pill does have a significant effect.

On the other hand, if we set a more stringent confidence level of .01, then we cannot reject the Null Hypothesis. The z-score calculation (of 2.15) falls well within the limits for a z-score indicating a .01 level of certainty (2.58). If we want to be 99 percent sure of our conclusions, we will have to agree that there is no difference between the memory-pill group and the placebo-group (2).

K. Self Test on a z-Test

1. Following exactly the same method as that demonstrated above, resolve the following question at the .05 significance level.

To assess the impact of windowless schools on the psychological development of school children, an anxiety test was given to a class of 40 children in a windowless school. The same test was given to a similar class of 30 students in a school with windows. The results of the test are as follows:

Windowless School School with Windows

Number in sample: 40 Number in sample: 30

Mean Score: 117 Mean Score: 112

Standard Deviation: 10 Standard Deviation: 12

If we set ourselves a confidence level of .05, can we conclude that there is a real difference in the anxiety levels of the two populations of students?

2. A study was undertaken to see whether blond-haired men had a more active dating life than black-haired men. In a random sample of 100 blond-haired men and 100 black-haired men, the following information was collected and calculated:

Blond-Haired Men Black-Haired Men

Dates per month (mean) 7.5 dates Dates per month (mean) 6 dates

Standard Deviation: 4 dates Standard Deviation: 3 dates

On the basis of this information, and at a confidence level of 0.01, examine whether there is a real difference in the dating frequencies of the two groups. Do blonds have more fun?

. 3. Recently at Malaspina University-College in Building 355 the loud complaint was heard echoing down a corridor that Liberal Studies instructors have very different marking standards and results. To test this claim, we collected 30 sample grades on main seminar essays from two Liberal Studies Instructors on the same team. The results are as follows:

Instructor A Instructor B

Number of papers: 30 Number of papers: 30

Mean grade (100): 78.17 Mean grade (100): 78.37

Standard Deviation: 8.40 Standard Deviation: 9.63

On the basis of this information, would you uphold or reject the complaint? Indicate the confidence level of your conclusion. Note that these figures are based on hard data from LBST 302 last semester, so your result is an analysis of what really goes on.

L. Conducting a z-Test: Two Sample Mean in Excel

In practice, conducting a z-test is easier that the procedure outlined above, because we can get Excel to carry out all the mathematics for us. All we have to do is collect the data on the two samples, request a z-test analysis from the Excel statistics options, and examine the table of the results. The following paragraphs describe the procedure.

1. First enter in the data for your two samples, one in Column A and the other in Column B. The number of entries in each column does not have to be the same, but there must be more than 30 entries in each column.

2. Then, using the Descriptive Statistics tool from the Data Analysis option (on the Tools Menu), generate the table of Descriptive Statistics for each sample. Note carefully the figure for the Variance of each sample.

3. Then from the Data Analysis option, select the last item: z-test: Two Sample Means. When you get the dialogue box, in the Variable 1 Range box, indicate the range of the first sample in Column A (e.g., $A$1:$A$35). In Variable 2 Range, indicate the range of the second sample in Column B (e.g., $B$1:$B$48). In the Hypothesized Mean Difference Box, enter the figure 0. Since the Null Hypothesis, which we are attempting to refute, says that both samples come from the same population, we are hypothesizing that the difference between the means for the two populations is 0.

4. In the Variable 1 Variance (known) box enter the figure for the Variance for the sample in the A Column (you will find this figure in the Descriptive Statistics box you generated in the second step described above). In the Variable 2 Variance (known) box enter the corresponding figure for the Variance of the second sample.

5. In the Alpha box the number 0.05 should already appear. Leave this alone for the moment (if it is empty or shows a number different from 0.05, then enter the number 0.05). The Alpha figure indicates the Confidence Level for this test. A figure of 0.05 states that you want to be 95 percent certain of the result or, alternatively put, that you want the probability of being wrong to be .05 or lower.

6. In the Output Range type the number of the cell where you want the Output Table to appear (or alternatively, with the line active in the Output Range box, click the mouse on an empty cell). The Output Range table will take up three columns and twelve horizontal rows.

7. Then click on OK. After a couple of seconds, a table should appear in the place designated by the Output entry. This table has the heading: z-Test: Two Samples for Means. You will need to widen the left hand column of the table in order to read the names of the items. In table there are figures for the following items: Mean, Known Variance, Observations, Hypothesized Mean Difference, z, P(Z<=z) one-tail, z Critical one-tail, P(Z<=z) two-tail, z Critical two tail.

The Mean figures gives the arithmetical average for each sample. It should be the same as the figure for the Mean in your Descriptive Statistics chart generated earlier. The Known Variance similarly gives the Variance for each sample and corresponds to the Variance figures generated earlier (these are the figures you entered into the z-Test dialogue box). The Observations is the number of items in each sample. The Hypothesized Mean Difference should be 0 (the figure you entered in the z-Test dialogue box earlier).

The z figure indicates in standard deviation units how far from the mean the figure for the difference between your two samples is located. Remember that the normal curve for all the differences between all the possible pairs of samples from the population has a mean of 0. Your two samples did not have the same mean; thus they fall away from the mean in the normal distribution. The z figure tells you how far away the difference falls.

Following the z figure there are four lines, two concerning one-tail and two concerning two-tail testing. You will use one or the other of these pairs of figures, not both. The one you use will depend upon the nature of your Alternative Hypothesis.

If your Alternative Hypothesis makes a claim about a particular difference between the two populations, then you need the one-tail figures. For instance, an Alternative Claim like “Women drink more alcoholic beverages than men” or “Instructor A gives higher Marks than Instructor B” or “People who smoke more than a pack of cigarettes a day have more heart attacks than people who do not smoke any cigarettes” then you will be needing the one-tail figures. Since your Alternative Hypothesis asserts that one of the populations will have a higher value than the other, then you are interested only in one end of the distribution curve.

However, if your Alternative Hypothesis simply asserts that there will be a significant difference between the populations (without asserting which will be higher or lower), then you need the two tail figures. For example, you will need a two-tailed test for any Alternative Hypothesis like the following: “There is a difference in the amounts of alcohol men and women drink,” “Instructor A and Instructor B mark at different standards,” “Older students’s marks in Liberal Studies are different from younger students’ marks.” And so on. Note that the interpretation of the figures is the same, no matter which of the two your are conducting, but the figures will be different.

The P(Z<=z) figure indicates the probability that the two populations are the same. Thus a figure here of, say, 0.03 would indicate that the probability of the Null Hypothesis being correct (that there is no difference between the populations) is .03 or 3 percent. Alternatively, there is a .97 probability that the Null Hypothesis is not correct (or p = .97).

Whether or not the figure for P(Z<=z) enables you to confirm or dismiss the Null Hypothesis will depend upon the confidence level you set. If your level is .05, then a figure of .03 (smaller than the specification) indicates that the Null Hypothesis should be rejected and the Alternative Hypothesis affirmed. However, if the Confidence Level you have set is .01, the a result for P(Z<=z) of .03 (which is higher than the specification) enables you to affirm the Null Hypothesis.

Another way quickly of determining whether to affirm or reject the Null Hypothesis is to examine the z Critical figure. This number indicates the value above which the z figure is too high for one to accept the Null Hypothesis. So to determine whether or not one should affirm or deny the Null Hypothesis, simply compare the z figure with the z-Critical figure. If the z figure is less than the z-Critical figure, one affirms the Null Hypothesis; if the z-figure is greater than the z-Critical figure, then one rejects the Null Hypothesis.

If you change the Confidence Level in a z-test of this sort, you will notice that the z-Critical values will change. For instance, if you go back and start the z-test over again, but this time in the Dialogue Box enter a value for Alpha of 0.01 (rather than 0.05), then you are demanding a Confidence Level of 99 percent or, alternatively, you want a result in which there is only 1 percent chance (p = 0.01) of your being wrong.

If you do that and generate a second z-Test: Two Sample for Mean table, you will notice that all the values in that table are the same as for the first table, except for the z-Critical values, which have increased. What that means, of course, as you should understand by now, is that if I want to be more confident of my result, I have to widen the interval within which I judge results to be acceptable.

Note that whenever you state a conclusion to a z-test, you must obviously indicate the confidence level you used. As we have discussed in the text, a Null Hypothesis which you reject at a .05 value for Alpha (a confidence of 95 percent), you may have to accept at a .01 value for Alpha (a confidence of 99 percent).

Notes to Section Seven

(1) The mathematical calculations which justify this formula for the standard error of the normal curve of the differences between the means we will not go into here. The principle of the formula, which is derived from the Central Limit Theorem, is that if we are combining two normally distributed characteristics, then the variance in the resulting distribution will be the sum of the variances in the two original frequency distributions. We add up the two squared standard errors to get the variance of the new distribution, and then we take the square root of that total to get the standard deviation of the new combined frequency distribution. [Back to Text]

(2) If we want to become more confident of our results than in the above example, we can narrow the confidence interval by increasing the size of the sample (thus lowering the value of the Standard Error). However, as we discussed in the last section, in Section N (p. 69), to achieve a useful reduction of the Standard Error, we will need a very large increase in the sample size (to reduce the SE by 50 percent, we must quadruple the sample size). [Back to Text]

[Back to Table of Contents]

[Back to johnstonia Home Page]