An Introduction to Data Analysis & Presentation
Prof. Timothy Shortell, Sociology, Brooklyn College
Comparing Means: Analysis of Variance (ANOVA)
The two sample t-test allows us to compare means for two subgroups. What do we do if we want to compare more than two groups? Let's say we are interested in looking at the mean ideology score of the working, middle and upper classes.
We could just do a t-test for each pair. This would be three t-tests. The problem with this strategy has to do with the risk of a type I error. When we set alpha to 0.05, we want the risk of this kind of error to be no more than 5%. But, if we do three t-tests, the total probability of a type I error is increased to 15% -- 3 tests x 0.05 = 0.15.
As you can anticipate, this becomes more of a problem with more groups. If we wanted to compare four groups, we would need 6 t-tests -- a 30% chance of type I error. If we had five groups, we would need 10 tests -- now, the likelihood of a type I error is 50%!
We need a way to compare more than two groups that does not inflate the likelihood of a type I error. This is called analysis of variance or, ANOVA. Instead of calculating t-scores, we will calculate an F-score.
(We need a new kind of score because we are using a different sampling distribution. The t-score was based on the t-distributions, and the F-score is based on the F-distributions.)
Sum of Squares
The t-score is a ratio of variation between groups to that within groups. The F-score is the same kind of ratio. With the F-score, the number of groups is more than two, so the formula is a more generalized way of comparing between groups variation to within groups variation.
We can measure variation with the concept of sum of squares.
This is the sum of squares within groups. It measures the amount that scores in a group tend to differ from one another. If we are examining the ideology scores of working, middle and upper classes, this would represent the amount that working class respondents differ from the working class mean, plus the amount middle class respondents differ from the middle class mean, plus the amount upper class respondents differ from the upper class mean.
We can also calculate the sum of squares between the groups.
This is the amount that the working class mean differs from the grand mean (the overall mean, taking all respondents together), plus the amount that the middle class mean differs from the grand mean, plus the amount the upper class mean differs from the grand mean.
The F-score, then, is:
Imagine, for a moment, what these two totals would be like if there was a lot of variation in ideology scores within class groups, but very little between classes. This would be the case if class had no relationship to ideology; in other words, members of the lower or working classes are no more liberal than members of the middle or upper classes, and so forth.
In this case, the F-score would be small -- close to zero.
Now, imagine that there is a lot of between groups variation, and little within groups variation. This would be the case if the group means for the classes were very different but members within each class were very similar. In other words, this would be the case if class accounted for almost all of the variation in ideology.
In this case, the F-score would be large.
Let's see how this works out in an example.
First, we set our hypotheses. The null states that the mean ideology score for the lower class is equal to that of the working class is equal to that of the middle class is equal to that of the upper class. The research hypothesis is that at least one of these groups is different from the others.
From the 1996 GSS, we generate the following results:
The F-score is 2.73, with a probability of 0.043. Just as with the t-test, we compare the probability with the standard criterion, 0.05. Since the probability of our F-score is less than the criterion, we reject the null hypothesis. We conclude that the mean ideology scores of the classes are not all equal to one another. We can see from the table of group means, that there appears to be some differences.
At this point, we only know that there is a significant difference somewhere among the class groups. They might all be different from one another, or, there may only be one significant difference among the six pairwise comparisons. In order to make sociological sense of the results, we need to know which groups are different from which others.
In order to identify which comparisons are statistically significant -- remember, the research hypothesis states only that at least one pair of means is different -- we must do what are called post hoc tests.
There are many different kinds of post hoc tests. They all do the same thing: compare sample means in such a way as to not increase the likelihood of a type I error (as doing t-tests on all the pairs of means would).
We will look at a post hoc test called Tukey's HSD. We won't worry about how the test is calculated. Instead, we will work on interpreting the results.
Here is SPSS output displaying the HSD test:
We want to know, at a 95% confidence level, which group means are different. The post-hoc test results are read just like t-test results. We compare the significance figure for each comparison with the 0.05. Any comparison that shows a probability less than or equal to 0.05 is considered a statistically significant difference.
We interpret the pattern of differences. In this case, only one comparison is reliable at the 95% confidence level. Respondents from the lower class have a significantly lower ideology score than respondents from the upper class. Referring to our scale, we can translate this into a sociological statement: Respondents from the lower class are significantly more liberal than respondents from the upper class. (Since all the means are on the conservative side of the scale, it would probably be better to say that respondents from the upper class are significantly more conservative than respondents from the lower class.)
In our data, the difference between the lower class and the upper class is large enough to be reliable, but all other differences are not. The means may appear to be different, but we cannot be sufficiently confident that the apparent differences reflect the true state of the social world.
The post-hoc test only indicates that a particular comparison is statistically significant -- that it is reliable. We need to assess whether it is sociologically meaningful. This is a judgment based on a knowledge of the variables and of the literature. In some cases, numerically small differences are meaningful; in others, they are not. In this example, we have a seven point scale, so there is only so much room for variation. A difference of about two-thirds of a point is meaningful. (We would also want to know how this compares with other studies of the same concepts.)
1. Formulate a hypothesis.
2. Interpret the following results:
3. Assess the sociological significance, if any, of the results. How would you explain the results?
All materials on this site are copyright © 2001, by Professor Timothy Shortell, except those retained by their original owner. No infringement is intended or implied. All rights reserved. Please let me know if you link to this site or use these materials.