[This handbook has been prepared by Ian Johnston of Malaspina University-College for students in Liberal Studies. This text is in the public domain, released May 2000]
One important job of statistics is the comparison
between two different sets of information about apparently different things
between which there may be a connection.
For example, we might want to compare how information about people’s
income compares with the information about their education, in order to explore
the claim that the more education one has, the greater the probability of a
better income. Or, alternatively, we
might wish to compare information about poverty with information about crime,
to see if there is a possible link between the two, or, again, compare
information about cigarette smoking habits with information about particular
health problems (e.g., heart disease).
These sorts of studies involve comparing two
variables (e.g., income and crime, smoking habits and health) in order to see
if there might be some connection and perhaps even a suggestion of cause. As a cigarette smoking habit rises, do
health problems also rise? As income
decreases, does the frequency of crime increase? As people grow older to they become less or more tolerant of
others?
What we are looking for in dealing with such
questions statistically is what is called a correlation, an apparent connection between the different values in
one set of data and in a second set of data, so that as the values in the first
set increases, the values in the second set also tend to increase (or, in the
case of a negative correlation, to decrease).
Here’s a simple example. In an English course, once the instructor has marked a set of
class essays, she has a record of the marks, one for each student. Then she assigns a second essay and marks
it. Now, she has a record of the marks
for two essays written by the same students.
The instructor might wish to know whether the
students who did well in the first essay also tended to do quite well in the
second essay, whether the students who did well in the first essay tended to
badly in the second essay, or whether there was no apparent connection between
the results of the first essay and the results of the second essay (such
information would be a useful test of the value of the first essay as a
predictor of success in the second). In
other words, she might like to see whether there is any significant correlation
between the two sets of marks.
Common sense suggests to us that some sets of
results should be quite closely correlated.
For instance, we would expect the results students received in English
111 to be positively correlated to the results in English 112 (so that students
who did well in the first should tend to do well in the second; students who
fare poorly in the first course should tend to fare poorly in the second). Similarly, we would expect people who
exercise regularly to have a healthier cardiovascular system than those who do
not or those who are wealthy to spend more on consumer goods, and so on.
But in some cases our common sense expectations
may be wrong, and often we may be interested, not just in whether there is a
correlation or not, but in how strong that correlation might be.
In considering correlation we recognize three
distinct possibilities. The two sets of
variables (that is, the two sets of data we are comparing, like exercise and
cardiovascular health) may show a positive correlation. That is, as the values in the first set of
data rise, the values in the second set of data tend to rise as well (i.e., those
who exercise regularly have a greater cardiovascular health). Or the two sets of data may display a
negative correlation. That is, those
who score higher on the first set of results tend to get lower scores on the
second set of results (e.g., the number of alcoholic drinks one has consumed
and one’s ability to carry out a manual dexterity test). Or, finally, there may be no correlation:
the two sets of data do not appear to have any relationship.
Correlation is an extremely important analytical
tool which enables us to begin to sort out claims about important connections,
which may or may not be true: the amount of smoking and the incidence of lung
cancer, HIV infection and the onset of AIDS, the age of a car and its value,
television programming of playoff games and attendance at lectures, poverty and
crime, IQ tests and income levels, intelligence and heredity, age and
mechanical skills, and so on. People
make claims about such matters all the time.
The principle of correlation enables us to investigate such claims in
order to understand whether they are true or not and, if true, just what the
strength of that relationship might be.
In the following section we will work our way
through a few examples, using Excel to do all the difficult work. At this point make sure you understand just what
the concept of correlation (positive, negative, zero) means.
Indicate for each of the pairs of variables listed
below what common sense tells you to expect in the way of an overall
correlation between the two sets (positive, negative, perfect, zero):
1. height
in centimetres and height in inches;
2. levels of income and amount of spending on consumer goods;
3. the age of a car and its value;
4. two scores thrown on two dice simultaneously;
5. snowstorms and attendance at college classes;
6. ability to see in the dark and the amount of carrots eaten;
7. daily cigarette consumption and cardiovascular health;
8. heights and weights of elementary school children;
9. weekly consumption of calories in food and drink and body weight;
10. age and physical strength in senior citizens.
Suppose we wish to explore the claim that people’s
political participation increases with the number of years they spend in
school. We think this claim is true, but
we wish to substantiate it. In other
words, we wish to see whether there is a correlation between the participation
in the political process and years of formal education.
First, we collect data from each person in the
study. We devise a test to measure a
person’s participation in the political process; we have people provide the
appropriate information on this variable, and we ask them the tell us their
years of formal schooling. This
collection yields the following information:
Respondent |
Political
Participation Score |
Years of Formal
Schooling |
A |
1 |
5 |
B |
2 |
8 |
C |
2 |
7 |
D |
3 |
9 |
E |
3 |
10 |
F |
4 |
11 |
G |
5 |
11 |
H |
6 |
11 |
I |
5 |
12 |
J |
6 |
12 |
K |
7 |
13 |
L |
8 |
14 |
M |
8 |
15 |
N |
9 |
15 |
O |
7 |
13 |
P |
10 |
16 |
Q |
9 |
16 |
R |
8 |
14 |
S |
9 |
13 |
T |
9 |
14 |
From a glance at the figures, we might conclude that
there seems to be some relationship, for the higher scores in the middle column
appear to be accompanied, in general, by higher scores in the right-hand
column. However, we want to be sure
about this point and not simply rely upon visual impressions from such a table.
One standard way visually to express the
correlation between two sets of variables is to draw a diagram in which each
result is plotted on a standard X-Y graph.
The X-Axis represents the value of one variable and the Y-Axis the value
of the other variable. Each score in
the set (in our example from Respondent A to Respondent T) is plotted on the
graph, so that a distinct point is located for each member in the data set
according to the two numerical values associated with each respondent.
For example, for the above data, we can construct
a graph in which the value for the Years in School is plotted on the X-Axis and
the value for Political Participation is plotted on the Y-Axis. Thus, we can locate each respondent’s
position exactly on the graph. We will
come up with twenty points.
The following chart (created by Excel) illustrates
the results. Notice that there are 20
points plotted and that each corresponds to a particular Respondent’s scores in
the two columns of the table above.
Notice that two people have 16 years of formal
schooling (two values on the vertical line drawn through 16 on the
X-Axis). Four respondents are shown on
the graph with a Political Participation Score of 9, two with a Political
Participation Score of 8, two with a Political Participation Score of 2, and so
on. We call such an illustration a scatter diagram or scatter plot. Make sure you
are familiar with what this graph represents and how to read it. Do not proceed until you feel very
comfortable with this visual presentation of information.
Look for a moment at the overall shape of all the
points in the chart. It seems to be
characterized by a generally linear cluster which rises upward and to the
right. This shape suggests strongly
that as the Y values rise (i.e., go vertically upward), the X values increase
also (i.e., move to the right). This
point is not true for every single point in the cluster, of course, but as an
observation about the overall general shape of all the points, that claim seems
to hold. Thus, these results seem to
show a positive correlation: the higher the value for the years of formal
schooling, the higher the score on the Political Participation test—that seems to
be the general trend of the entire range of respondents (even if some of the
plotted points do not follow that overall trend).
One way to
emphasize the overall orientation of all the plotted points in the scatter
diagram is to draw through the plotted data a straight line which comes as
close as possible to all of the plotted points. Such a line is called the
line of best fit (or the regression
line). If we do that for the above
graph (and Excel will draw such a line for us), we can recognize immediately
the overall orientation of the results as sloping upward and to the right.
Notice that this chart presents the same data as
the last chart, except that there is a plotted straight line (of square points)
through the points plotted from the data we collected from the respondents (the
diamond shaped points). This line,
which is derived from the plotted points, is the straight line which comes closest
to all the points which correspond to the measured data. It is called the line of best fit or the regression line.
We can see from this line of best fit that it very
clearly slopes upward and to the right.
This confirms our sense from the overall shape of the cluster that it
indicates a positive correlation: an upward slope clearly demonstrates that as
the Y values increase the X values tend to increase.
We can use the regression line, once we have drawn
it, as a general predictor. That is,
with the line in place, we can use it to read off values for cases not included
in the study. For example, if we want
to know what this data reveals about the probable political participation score
someone with 14 years of formal schooling, we can read directly up from the
value 14 on the X-axis, see where it meets the regression line, and read
horizontally across to the Y-axis. That
value in this chart is (about) 7.8.
Thus, on the basis of this study, we can offer a prediction that someone
with 14 years of formal schooling will probably have about 7.8 as a political
participation score. Obviously, this is
not a sure-fire guarantee of such a score; it is, however, something of an
educated guess.
Using the regression line in this manner obviously
yields only approximate results, and it may in some cases be misleading if we
rely on it too heavily (you notice, for example, that most of the plotted
points, which represent the hard data, the actual people in the study, do not
fall directly on the line). It is, at
best, a rough guide in a study of the sort we have been considering.
A Common Misuse of the Regression
Line It
is often very misleading to use the
regression line to make predictions outside the range of the data. For instance, I can extrapolate (extend)
the straight line, once I have drawn it, and then make predictions well
beyond the plotted points (say, about the political participation of someone
with 21 years of schooling). This is
often an illegitimate procedure leading to very wrong conclusions (as we
shall see from a couple of examples). |
We can draw in a Regression Line by hand, visually
estimating where it should come. This
is obviously a somewhat subjective procedure, and different people drawing the
line will come with slightly different results. Excel, however, will plot the line for us very quickly (as it did
for the above diagram). We will be
reviewing that procedure later.
Please remember always that a positive correlation
does not mean that there will no exceptions to the overall trend. There may be one or more results which do
not fit the overall trend of the entire collection of data. Correlation deals with the general tendency
of the entire collection of data.
Here is another set of measurements. This time the middle column represents the
student’s scores in percentages for class participation, and the column on the
right represents the same score out of 20.
Name of Student |
Participation Mark
(100) |
Participation Mark
(20) |
AB |
65 |
13 |
BC |
90 |
18 |
CD |
83 |
16.6 |
DE |
74 |
14.8 |
EF |
84 |
16.8 |
FG |
78 |
15.6 |
GH |
60 |
12 |
HI |
72 |
14.4 |
IJ |
82 |
16.4 |
JK |
94 |
18.8 |
KL |
50 |
10 |
LM |
75 |
15 |
MN |
58 |
11.6 |
If we plot these points on a scatter diagram as
before, putting the Participation mark (100) on the X-axis and the
Participation mark (20) on the Y-axis, we get the following diagram.
Notice here that the general shape of the plotted
points moves upward and to the right—indicating once more a positive
correlation. In this example, there is
an added observation: the data we plotted forms a perfect straight line, with
every plotted point from the data we collected falling exactly on the same
straight line. Clearly, if we wanted to
draw a line of best fit here, it would coincide exactly with all the data.
Such a result produces what we call a perfect positive correlation: every
increase in the value of the X-axis brings about an exactly corresponding
increase in the Y-axis value. We would
expect this result, of course, because the values are basically the same
measurement. A similar result would
occur if, for example, we plotted the weight of ten students in grams against
the weight of the same students in pounds or a range of temperature readings in
Celsius against the same readings in Fahrenheit.
Here is another pair of sets of variables. The first indicates the distance travelled
in a new prototype car through the desert at an even speed. The second column indicates the amount of
gasoline left in the car’s gas tank at each of the inspection points.
Place |
Distance
from Home (Miles) |
Gasoline Left in the Tank (Gallons) |
|
|
|
Home |
0 |
15 |
Podunk |
50 |
14 |
Buzzard Breath |
150 |
12 |
Paradise Valley |
275 |
9.5 |
Last Hope |
350 |
8 |
Deadman’s Gulch |
425 |
6.5 |
No More Gas |
540 |
4.2 |
Last Chance |
680 |
1.4 |
Hope Gone |
700 |
1 |
Miles from Nowhere |
750 |
0 |
Once again, we can illustrate these values in a
scatterplot diagram to observe the nature of the correlation.
Here we see again that we are dealing with a
perfect correlation, because all the plotted points fall on the same straight
line. But this time the line slopes
downward and to the right. This
indicates that the highest values on the Y Axis are associated with the lowest
values on the X-Axis, or that as one value increases the other decreases (as
the distance increases the gasoline left in the tank decreases). This example illustrates a perfect negative correlation.
Just as in the first example (about political
participation scores) we can have a positive correlation which is not exact, so
we can find a negative correlation which is not exact. In such a case the plotted points will not
all fall on the same straight line, but the general shape of the cluster (and
the regression line we draw through them) will slope downward and to the right.
For example, if we plotted a set of figures
indicating the amount of alcohol consumed against a second set of figures
indicating success in a simple test of physical dexterity, we would expect to
get a negative correlation: the more drinks consumed, the lower the score in
the physical dexterity test. Here is an
example of such a scatter plot.
The shape of this data in the scatter diagram
indicates what we would expect, a negative correlation. As the values on the X-Axis (the number of
drinks) increases, the value on the Y-Axis (the physical dexterity score)
decreases.
When there is no correlation at all between the
two sets of variables, then we will have trouble recognizing an upward or
downward overall shape to the plotted data, and the line of best fit will fall
somewhere in a horizontal position among the plotted points.
For example, here are two sets of variables: the
first lists the number of letters in the name of geographical location in
Canada, and the second lists its longitude (approximately). We want to know if there is a correlation
between how many letters there are in a Canadian place name and its
geographical location. Our common sense
suggests that there should not be a correlation, but we just want to check.
Name |
Number of Letters in Name |
Longitude |
|
|
|
Prince Rupert |
12 |
130 |
Courtenay |
9 |
125 |
Kirkland Lake |
12 |
80 |
Brandon |
7 |
100 |
Glace Bay |
8 |
60 |
Clarenville |
11 |
54 |
Dundas |
6 |
80 |
Lac Ste Jean |
10 |
72 |
Biggar |
6 |
108 |
Pickle Crow |
10 |
90 |
Williams Lake |
12 |
122 |
Yarmouth |
8 |
66 |
Sherbrooke |
10 |
72 |
Oba |
3 |
84 |
Calgary |
7 |
114 |
If we plot these points on an X-Y graph to produce
a scatter diagram, we can then inspect the distribution and draw in a line of
best fit. The plotted data looks as
follows:
Try to estimate where you would draw the
regression line through these plotted points.
Notice that one cannot so easily produce one like those we have already
drawn, that is, with a distinct slope indicating a positive or negative
correlation. And, thus, based on the
very small sample we chose, from a visual inspection of the plots, there does
not seem to any relationship between the number of letters in a Canadian place
name and its longitude.
The scatterplot diagrams and the regression line
give us a general visual idea of what the correlation is between two sets of
variables. In many cases, however, we
require a more accurate measure, so that we can compare one correlation with
another. In other words, knowing that a
correlation exists is valuable; more valuable, however, is to know the size of
that correlation. Such a mathematical
measure of the correlation between two sets of variables is called the Correlation Coefficient. It is most commonly symbolized by the letter
r.
The following paragraph outlines how the
Correlation Coefficient is calculated.
However, this method is something Excel will do for us, once the data is
entered. The method is listed here for
interest:
1.
First we
transform the scores in each of the sets of data into z-scores. Remember that a z-score is a measure of how far any
particular score is from the mean of the entire set and that the units of z-scores are standard deviations (i.e.,
a z-score of 2.5 means that this
particular value is 2.5 standard deviations above the mean; a z-score of -1.8 means that this value
falls 1.8 standard deviations below the mean)
2.
We then
multiply together the corresponding z-scores
in each list (i.e., the z-score for
one measurement is multiplied by the z-score
for the corresponding measurement from the other set of data.
3.
Then, we
add up all the results produced by Step 2 above.
4.
Finally,
we divide the figure obtained in Step 3 by the total number of pairs of scores
(i.e., get the average of figures derived in Step 2.
This process will always produce a number between
-1.00 and +1.00. This number is called the correlation coefficient (symbolized
by the letter r). If r
= -1, then the correlation is a perfect negative correlation; if r = 1, then the correlation is a perfect
positive correlation. If r = 0, then the correlation is zero (no
relationship between the variables).
Here’s an exercise to demonstrate a practical
application of correlation and to show how Excel will do all the mathematics
for us.
Suppose, as teachers or employers, we are
interested in whether a particular diagnostic test we have been using is a good
predictor of success. In other words,
does the mark on an entry-level test give us reliable information about how
well a particular person will do in, say, an academic program? One way of answering this question is to use
correlation.
Here is a chart of data we collected on ten
subjects (not a large enough sample, but it will enable us to go through the
procedures for calculating correlation).
Each one wrote the same diagnostic examination to get into a course, and
each one wrote the same final examination eight months later.
Name |
Diagnostic
Score (100) |
Final
Mark (100) |
|
|
|
AB |
76 |
80 |
BC |
85 |
76 |
CD |
86 |
62 |
DE |
75 |
84 |
EF |
84 |
87 |
FG |
64 |
70 |
GH |
85 |
83 |
HI |
89 |
84 |
IJ |
71 |
70 |
JK |
87 |
80 |
Enter this data onto a new Excel worksheet,
putting the results of the diagnostic test in Column A (from A1 to A10) and the
results on the Final Mark in Column B (from B1 to B10). To get Excel to conduct a correlation
analysis on the data, carry out the following steps.
1.
First,
point and click on the Tools option on the top line, and from the drop-down
menu select Data Analysis, which is normally the last item on the menu. Note that if Data Analysis does not appear
you will have to add that to the Tools option (using the process explained in a
previous section). When the Data Analysis
menu appears left click the mouse on the option Correlation.
2.
You will
then get a box asking you for the Input range and the Output range. The Input range is the group of cells you
want in the analysis (in this example the ten cells in Column A and the ten
cells in Column B). Enter this
information in the Input Range box, not forgetting the dollar signs which
indicate a range: $A$1:$B$10; this command is telling Excel to include in the
analysis the rectangular block of cells defined by A1 and B10 at the corners.
3.
Then,
click the left out button in the circle to the left of the label Output Range
(unless there is a dark dot in the circle already to indicate that that option
has been selected). Then click the left
mouse pointer in the Output Range box (the long horizontal blank rectangle). Once you have clicked the mouse button in that
box, enter the name of the cell where you want the data to appear (once again
the top left corner of the table you are about to generate). Let’s choose cell C1. So in the Output Range box type in
$C$1. Then click on OK.
4.
After a
moment, on the worksheet a small table will appear with its top left hand
corner in Cell C1. This table indicates
the correlation between the columns. In
the case of our example, the correlation between Column 1 (Diagnostic Score)
and Column 2 (Final Mark) is 0.31 (or r
= 0.31). Thus, in this case the
mathematical calculation carried out by Excel reveals a small positive
correlation between the results on our diagnostic test and the results on the
final mark.
5.
Notice in
the small table we generated in this exercise that the correlation of Column 1
and Column 1 is 1 (i.e., perfect), as we would expect, since we are comparing
that column with itself. The same is
true for the correlation between Column 2 and Column 2.
Now that we have a mathematical value for our
correlation coefficient (r = 0.31),
how are we to interpret that? Does this
mean that our diagnostic test is a good one, that we can rely on it?
As mentioned earlier, the strongest positive
correlation is 1.0, and the closer our r
value is to 1.0 the stronger the correlation between the two sets of values we
are analyzing. The closer to 0 our r value, the weaker the
correlation. The following general
categories indicate a quick way of interpreting our calculated r value:
0.0 to 0.2 Very weak to negligible
correlation
0.2 to 0.4 Weak, low
correlation (not very significant)
0.4 to 0.7 Moderate
correlation
0.7 to 0.9 Strong,
high correlation
0.9 to 1.0 Very strong
correlation
Our result of 0.31 is thus very weak or low; as an
evaluative tool our diagnostic test is not all that useful, since the results
on it do not reveal very much about the future results in the final marks. There is some positive correlation, but not
enough to make the diagnostic reliable as a predictor (say, for counselling
students or alerting the teacher to potential problem students).
One useful rule of thumb for estimating the
importance of the r value is to
calculate the square the correlation coefficient (i.e., calculate r2). This squared result will give us a rough
percentage for the amount of variation in the final result which is directly
attributable to the other variable.
For example, suppose our diagnostic test was a
good measure of writing ability. The
correlation between that writing test and the final marks is 0.31. If we square this value, we get the value
0.096, or 96 in 1000 or 9.6 percent. On
the basis of this, we can claim that 9.6 percent of the students’ success in
the course is attributable to the writing skills they had at the very start of
it (i.e., what we measured in the our diagnostic test). We would then have to conclude that success
is this course is not very dependent upon the writing skills the students have
upon entry.
Suppose now we administer another diagnostic test
at the very start of the course, this time measuring reading ability. And we discover when the course is over that
the correlation between the results of that diagnostic test and the final marks
for the course is 0.8 (r = 0.8). If we square 0.8, we get 0.64 or 64
percent. We can then claim that 64
percent of the students’ success in the course is directly attributable to the
reading skills they possessed (and which we measured) when they started. Obviously, this conclusion assumes that the
diagnostic test is an excellent measure of the students’ reading ability.
It will be apparent that correlation analysis is a
very useful tool for investigating all sorts of claims. We may often think that one particular
skill, quality, or value has a direct and important effect upon another (i.e.,
years in school and income at age forty, frequency of smoking and incidence of
heart disease, success in a particular test and success in something different,
and so on). Correlation analysis
enables us to test such claims and to provide some quantifiable measure to
them.
Here is another common example of the usefulness
of correlation. Suppose we are
interested in dealing with a question of heredity: Is there a relationship
between the height of mothers and the heights of their sons and, if so, how
strong is it? Well, we carry out a
series of measurements of the heights of mothers and sons and conduct a
correlation analysis of the results.
Suppose we discover that the correlation coefficient in our study is 0.7
(i.e., r = 0.7). If we square this figure, we get 0.49, or 49
percent.
This means that 49 percent of the variation in the
sons’ heights can be attributed to the heights of their mothers. This correlation is moderate and indicates that
predicting the height of a son on the basis of his mother’s height is
moderately successful. The remaining 51
percent of the factors influencing the sons’ height come from elsewhere (e.g.,
the fathers, environment, food, and so on).
Suppose further that we now conduct a correlation
study with the same sons, but this time analyzing the relationship between
their heights and their fathers’ heights.
We find that r = 0.8. If we square this we get 0.64. We can thus estimate that 64 percent of the
sons’ variation in height is attributable to their fathers’ heights.
You might be drawn into some confusion here: if 64
percent of the sons’ heights is attributable to the fathers’ height and 49
percent to the mothers’ heights, then we have more than 100 percent. How can that be? Well, in some cases the heights of the mother and the father act
together in determining the height of the child. Statistically speaking, the heights of the fathers and the
heights of the mothers are also correlated, since people tend to marry people
closer to their own height—or taller women tend to marry taller men and vice
versa, and shorter women tend to marry shorter men, and vice versa. So the total effect is more complex than
just adding together the two separate percentage contributions.
We want to track the progress of a class of ten
students in a particular class of English 111, in order to ascertain if our
ways of predicting success in the course are useful. We have a record of their Grade XII marks, and we administer a
diagnostic reading test in the first class.
At the end of the semester, we have the final marks of the
students. We now wish to know which of
the two assessments—the Grade XII mark or the Reading Test Mark—provides a
better guide to the students’ success in the English 111 class. We will assume for the sake of this exercise
that the sample is an accurate one (even though it is too small).
Here is the raw data, presented in tabular
form. Copy this information onto a
fresh Excel worksheet, putting the names in Column A, the Grade XII results in
Column B, the Diagnostic Test Marks in Column C, and the Final Marks in Column
D.
Name |
Grade XII Mark (100) |
Diagnostic Reading Test Mark (100) |
Final Mark (100) |
AB |
75 |
75 |
75 |
BC |
72 |
69 |
76 |
CD |
82 |
76 |
83 |
DE |
78 |
77 |
65 |
EF |
86 |
79 |
85 |
FG |
76 |
65 |
79 |
GH |
86 |
82 |
65 |
HI |
89 |
78 |
75 |
IJ |
83 |
70 |
80 |
JK |
65 |
71 |
70 |
When you have entered the information, select the
three numbered columns (do not include any cells with words in them). Go to the Data Analysis option on the Tools
menu, select from that Data Analysis menu the item Correlation (note, once
again, that if the Data Analysis option is not on the Tools menu you have to
add it in).
When you get the Correlation menu, enter in the
Input Range the block of cells you wish to analyze (i.e., from B2 to D11, if
you have used Column A for the names and Row 1 for the titles). Do not forget to put in the dollar signs.
Then click the mouse pointer in the circle to the
left of the Output Range label (unless there is a black dot in it already), and
click the left mouse button in the Output Range box. Then enter the name of cell where you want the top left corner of
the correlation table to appear (e.g., $A$13).
Then click OK.
After a second or two, the Correlation Table
should appear giving you the correlation between all the different pairs of
data. We are interested in the
correlation between Column B (the first column in the Table) and Column D (the
third column in the table) and between Column C (the second column in the
Table) and Column D. Which of these two
is the better predictor of success according to this study. How reliable is it? For the answers see the paragraphs at the
end of this section.
Using this procedure we can examine the correlations
between several sets of data entered on an Excel worksheet. For example, if we have all the marks for
all the Liberal Studies assignments entered on a single sheet, together with
the final mark, we can simultaneously calculate the correlation between the
marks for any particular set of assignments and the final result. This would, among other things, enable us to
estimate which assignments were the best and worst predictors of success in the
program.
It is very important to grasp the point that a
correlation, even a very strong correlation, does not enable us immediately to
make a conclusion about causation. If,
for example, we find a very high correlation between the number of years people
spend in post-secondary education and their income at age 40, we may make some
predictions about income at age 40 based on years of post-secondary education
or we may urge people to stay in university because the more they study at
university the higher their eventual income will tend to be, but we should be
aware that the correlation, in itself, is no proof of these assertions..
This is a vitally important principle: correlation
is not necessarily a proof of causation.
It indicates a relationship which may
be based on cause and effect, but it may not be. If A is a major cause of B, then we can expect that variations in
A will cause changes in B (i.e., there will be a correlation). The reverse, however is not necessarily
true. If X and Y are correlated, we
cannot automatically assume that X is the cause of Y.
This issue is an important point of contention in
the political disputes about AIDS.
There is a very high correlation between HIV infection and the
occurrence of AIDS, and thus many researchers have from the start assumed that
HIV is the principal cause of AIDS. On
the basis of this assumption, most of the research money for AIDS has gone into
investigating HIV. Yet it is still not
clear what causes AIDS, and some people (especially those suffering from AIDS)
have argued very strongly that investigating HIV instead of AIDS is a mistake:
the real cause is something else, something which requires much more money
going directly into research on AIDS rather than into HIV. The social, political, medical, and
financial consequences of this argument are substantial.
Suppose we do find a significant positive
correlation between two variables, X and Y (for example between provincial
government expenditures on education and the average income of citizens), we
recognize that there may be four different reasons why this relationship
exists:
1.
First, X
may indeed cause Y. That is, the fact
that the provincial government spends more on education is indeed the reason
that more citizens stay in university longer and get higher paying jobs when
they graduate. Thus, the average income
of the province increases.
2.
Secondly,
however, the result may be just chance.
This is quite possible if we are sloppy about our sampling (we haven’t
talked about that). But if we make many
observations and sample the population correctly and repeatedly, this reason
for the correlation becomes very unlikely (the more the tests the less likely
that the results occur by chance).
3.
Thirdly,
there may be some third factor we have not taken into account which produces
the variation in both X and Y and which is the real cause of the
correlation. This conclusion seems to
be emerging in the analysis of AIDS and HIV.[1]
4.
Finally,
there may be a causal connection which is responsible for the correlation, but
we may have put it the wrong way around.
We might conclude from the high correlation between the government’s
expenditures on post-secondary education and average family income that the
expenditures on education are causing that increase in income. But it may be the case the higher family
incomes are caused by something else; they are providing the government with
increased tax revenue, which the province (for any number of reasons) has
decided to spend on post-secondary education (thus making education more
accessible and affordable and, hence, more popular).
Does this ambiguity in interpretation mean, then,
that correlation studies are of limited value?
By no means. It simply reminds
us that we have to be careful about the conclusions we draw from a correlation
study. Remember this important point:
every time a particular factor (A) is a cause of something else (B) there will
be a correlation between them. Thus, we
can use correlation as a negative test.
If there is no correlation between two variables, we can assume that one
of them is not a cause of the other. If
there is a correlation we have strong evidence that one of them may be the cause of the other and that
we should, therefore, investigate further.
Here are the numerical results of a hypothetical
study in which various skills measured independently are to be analyzed for a
possible correlation with annual earnings among a group of salesmen in similar
industries. Enter these figures onto an
Excel worksheet, and generate a correlation table for all the data. Which of the four measured qualities has the
strongest positive correlation with the annual salary?
When you have discovered the quality which appears
to have the strongest correlation, produce a scatterplot diagram to illustrate
the relationship between this quality and the annual salary. Generate a regression line for the
scatterplot.
Name of Salesman |
Public Speaking |
Writing Ability |
Math Skills |
General Knowledge |
Annual Earnings |
AB |
76 |
64 |
53 |
77 |
80,000 |
BC |
56 |
64 |
89 |
75 |
60,000 |
CD |
77 |
70 |
60 |
70 |
72,000 |
DE |
90 |
95 |
62 |
82 |
92,000 |
EF |
68 |
68 |
88 |
68 |
65,000 |
FG |
83 |
85 |
61 |
85 |
90,000 |
GH |
61 |
60 |
79 |
85 |
54,000 |
HI |
79 |
73 |
82 |
68 |
80,000 |
IJ |
64 |
83 |
59 |
81 |
65,000 |
JK |
88 |
62 |
89 |
66 |
79,000 |
When you have created a chart in Excel, you can,
as mentioned above, print the chart directly from Excel. However, in many cases you may want to
transfer the chart into WordPerfect or Word document, so that the illustration
appears in the middle of your own text (as in this module).
This procedure is fairly easy to do, so long as
both Excel and the word processing program are in Windows. Here is an outline of the simplest
procedure.
Create or call up the chart in Excel. Make sure the chart has all the shadings and
headings and borders and legends you want.
Once the chart has been formatted to your satisfaction, select the
entire chart by the method outlined earlier (so that the outer perimeter of the
chart is demarcated by small squares.
You can alter elements of the chart once it is in your word processing
document, but the options are more limited and sometimes more difficult to
carry out than in Excel. So make sure
the chart looks just as you want it to in the word processing document before
you transfer it.
Warning Before you start trying to move
material (e.g., charts) from one document to another, make sure you save the
Excel worksheet which contains the chart.
It is not uncommon for mistakes to occur the first couple of attempts,
and you will save yourself considerable time if you can go back to the
original Excel chart and start the transfer again, rather than having to
generate the chart all over again. |
Then, with the chart selected, point the mouse
arrow onto the Edit option at the
top of the screen, click the left mouse button, and from the drop-down menu
select Copy. Once you do that, you will notice that the
chart now has a flashing dotted line around it (indicating that it has been
selected for copying).
Now, place the mouse arrow directly on the button
with a negative (minus) sign in the very top left of your Excel window (just
above and to the left of the word File).
From the drop-down menu, select Minimize. Your Excel document will then disappear, and
a small Excel icon will appear on the bottom left of your screen, with the name of the Excel file containing the chart
under the icon. Whenever you want to
return to the Excel chart, all you have to do is click the mouse on this Excel
icon.
Now call up the word processing program you want
from the Windows Applications menu.
Once in the right program, call up the document in which you wish to
place the chart, and move through the document until you have the exact place
where you wish the chart to appear.
Place the cursor in the exact place.
Then from the main menu at the top, select Edit,
and from the drop-down menu select Paste. The chart should now appear, with the upper
left hand corner in the spot where you placed the cursor. Note that if you have asked for the chart to
be placed on a page where there is insufficient room for it, the Paste command
will place the diagram on the next page (i.e., the diagram will not go over a
page break).
Sometimes in the transfer from Excel to the word
processing program, minor formatting problems may arise. Inspect the chart carefully once you have
transferred it. If everything is as it
should be, fine; if not, then you may need to edit the chart further in the
word processing program. To put the
diagram into the edit mode, double click the left mouse button with the mouse
pointer on the diagram. In the edit
mode you can make adjustments to the chart.
For instance, a common occurrence when you move an
Excel diagram into Word is that the units of the X-axis become aligned
incorrectly. You can deal with this in
the Word diagram edit mode by selecting each figure on the X-axis and moving it
back to the position you want it. To
move a number, simply point to it with the mouse, hold the left button down. The number will change to a dotted rectangle
which (with the left mouse button still held down) you can move up or down, to
the left or to the right, as you wish.
In the same way, you can alter the text in the
diagram once it is in your word
processing document. When the diagram
is suitable and you wish to exit from the edit mode, return to the normal
document via the file menu on the diagram edit program (not the main file menu
at the very top).
Note that in your text you can alter the size and
the position of your chart. If you
point the mouse indicator onto the diagram and click once, you will see that
the chart has a perimeter line with some squares half way along each side. If you position the mouse indicator on one
of those squares, the indicator will turn into a double arrow; press down the left
mouse button and drag the side of the chart in the direction to want (to
enlarge or reduce its size)
To centre the chart on the page, select the
diagram, and choose the centre command, either from the Format-Paragraph option
or from the centre button on the toolbar.
Self-Test On Excel’s
Correlation Function (Section M)
The correlation between the Grade XII Mark and the
Final Mark is 0.23; the correlation between the Diagnostic Reading Test and the
Final Mark is -0.28. Thus, the
Diagnostic Reading Test has a negative correlation; the Grade XII Mark a
positive correlation with the Final Mark, the negative correlation being
slightly stronger than the positive correlation. However, both figures are so low, that the correlation is
minimal. The skills measured by the
Grade XII marks account for about 5 percent of the skills measured by the Final
Mark.
Self-Test on Correlation
Charting and Calculations
The correlation between the first column (Public
Speaking) and Annual Earnings is 0.91, an extremely high reading. The scatterplot and regression line for
these two variables looks something like the following:
[Back
to johnstonia Home Page]
[1] This point was the subject of a very famous and controversial study of education and income carried out by Christopher Jencks. He analyzed the common claim that income at age 40 is directly correlated to the number of years of schooling (a very strong case for staying in school). Jencks argued that that correlation was, in fact, produced by a much more important factor, namely, the income of one’s parents. People whose parents had more money tended to stay in school longer and tended to have higher incomes at age 40. Jencks concluded that the common claim about staying in school increasing one’s earning power (i.e., causing an increase in income) was unsubstantiated. The real cause was parents’ income. Not surprisingly, the book (Inequality) was very hotly debated.