An Introduction to Data Analysis & Presentation
Prof. Timothy Shortell, Department of Sociology, Brooklyn College

The Crosstabulation
Often, we want to know if one variable is related to another--that is, if attributes of one variables are associated with attributes of the other. So far, we have limited ourselves to one variable at a time. As it turns out, we have a frequency table that demonstrates contingent frequencies.

For example, we might want to know if there is a relationship between gender and vote in the 1980 presidential election, in a sample of U.S. elites. The crosstab illustrates this.

The crosstab is a simple but very useful tool for examining causal relations among categorical variables. Let's consider another example. Is there a relationship between ideology and frequency of attendance at religious services in our sample of U.S. elites?

Inference with Categorical Data
If we want to use the contingency table to make a claim about the relationship between variables in the population, then we need to use a significance test with the crosstabulation. We follow the usual steps of hypothesis testing. First we state the hypotheses.

For example, let's consider an example we've discussed before. If our research question is "does approval for the President depend on whether or not one lives in urban places?" then we can formulate the hypotheses:
H0: There is no relationship between urban residence and approval of the President
H1: There is a relationship beween urban residence and approval of the President

Using public opinion data (in this case, the ABC2010 dataset), we can calculate the following crosstabulation:

 
             | USR2
         Q1r |         0 |         1 | Row Total | 
-------------|-----------|-----------|-----------| 
           0 |       399 |       105 |       504 | 
             |     0.519 |     0.447 |           | 
-------------|-----------|-----------|-----------| 
           1 |       370 |       130 |       500 | 
             |     0.481 |     0.553 |           | 
-------------|-----------|-----------|-----------| 
Column Total |       769 |       235 |      1004 | 
             |     0.766 |     0.234 |           | 
-------------|-----------|-----------|-----------| 
In this table, 0 indicates "not urban" for USR2 and "disapprove" for Q1r.

The significance test is:

Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  3.737326     d.f. =  1     p =  0.05320955 

Pearson's Chi-squared test with Yates' continuity correction 
------------------------------------------------------------
Chi^2 =  3.454688     d.f. =  1     p =  0.06307263 

 
Fisher's Exact Test for Count Data
------------------------------------------------------------
Sample estimate odds ratio:  1.334739 

Alternative hypothesis: true odds ratio is not equal to 1
p =  0.06226612 
95% confidence interval:  0.9851545 1.811241 

Alternative hypothesis: true odds ratio is less than 1
p =  0.977726 
95% confidence interval:  0 1.72726 

Alternative hypothesis: true odds ratio is greater than 1
p =  0.03147629 
95% confidence interval:  1.032496 Inf 
We'll use the Pearson's chi-squared test. In the case of a 2x2 table, we use the Yates correction. If the result is statistically significant, we can use the odds and odds ratio to discuss the strength of the linear relationship between urban residence and approval of the President.

The R command to produce this table (assuming you loaded the ABC2010 dataset and attached it) is:
CrossTable(USR2, Q1r, prop.r=F, prop.c=F, prop.t=F, prop.chisq=F, chisq = T, fisher = T)

Let's look at another example, this time from the CBS2011 data:
the R code is CBS2011<-read.csv("http://www.courseserve.info/files/CBS2011r.csv")
attach(CBS2011)
CrossTable(Q1[URBN==1 | URBN==3], URBN[URBN==1 | URBN==3], prop.r=F, prop.t=F, prop.chisq=F, chisq=T, fisher=T)

which produces

 
                          | URBN[URBN == 1 | URBN == 3] 
Q1[URBN == 1 | URBN == 3] |         1 |         3 | Row Total | 
--------------------------|-----------|-----------|-----------|
                        1 |        43 |       140 |       183 | 
                          |     0.717 |     0.496 |           | 
--------------------------|-----------|-----------|-----------|
                        2 |        17 |       142 |       159 | 
                          |     0.283 |     0.504 |           | 
--------------------------|-----------|-----------|-----------|
             Column Total |        60 |       282 |       342 | 
                          |     0.175 |     0.825 |           | 
--------------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  9.644134     d.f. =  1     p =  0.001899572 

Pearson's Chi-squared test with Yates' continuity correction 
------------------------------------------------------------
Chi^2 =  8.779236     d.f. =  1     p =  0.003046787 

 
Fisher's Exact Test for Count Data
------------------------------------------------------------
Sample estimate odds ratio:  2.558731 

Alternative hypothesis: true odds ratio is not equal to 1
p =  0.002556915 
95% confidence interval:  1.353408 5.025652 

Alternative hypothesis: true odds ratio is less than 1
p =  0.9995208 
95% confidence interval:  0 4.523515 

Alternative hypothesis: true odds ratio is greater than 1
p =  0.001330111 
95% confidence interval:  1.484226 Inf
Rather than recode the URBN variable, which has more than 2 categories, I told R to select only cases where the value of URBN was 1 (large central city) or 3 (suburb). Q1 is again a measure of approval of the President, where 1 is "approve". To make sure there are the same number of cases for both variables, you need to use the selection code ([URBN==1 | URBN==3]) when identifying both variables in the CrossTable() function.

If you want to test a relationship for variables with more than two categories you can use the chi-squared test without the odds. (Odds are calculated in the CrossTable function with the fisher=T option.) For example, the command CrossTable(Q1[URBN!=4], URBN[URBN!=4], prop.r=F, prop.t=F, prop.chisq=F, chisq=T) produces:

 
              | URBN[URBN != 4] 
Q1[URBN != 4] |         1 |         2 |         3 |         5 | Row Total | 
--------------|-----------|-----------|-----------|-----------|-----------|
            1 |        43 |        71 |       140 |        73 |       327 | 
              |     0.717 |     0.497 |     0.496 |     0.403 |           | 
--------------|-----------|-----------|-----------|-----------|-----------|
            2 |        17 |        72 |       142 |       108 |       339 | 
              |     0.283 |     0.503 |     0.504 |     0.597 |           | 
--------------|-----------|-----------|-----------|-----------|-----------|
 Column Total |        60 |       143 |       282 |       181 |       666 | 
              |     0.090 |     0.215 |     0.423 |     0.272 |           | 
--------------|-----------|-----------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  17.84538     d.f. =  3     p =  0.0004733554 
In this case, I told R to exclude cases where URBN is 4 (other) because there were no cases.

All materials on this site are copyright © 2001, by Professor Timothy Shortell, except those retained by their original owner. No infringement is intended or implied. All rights reserved. Please let me know if you link to this site or use these materials.