An Introduction to Data Analysis & Presentation Prof. Timothy Shortell, Department of Sociology, Brooklyn College The Crosstabulation Often, we want to know if one variable is related to another--that is, if attributes of one variables are associated with attributes of the other. So far, we have limited ourselves to one variable at a time. As it turns out, we have a frequency table that demonstrates contingent frequencies. For example, we might want to know if there is a relationship between gender and vote in the 1980 presidential election, in a sample of U.S. elites. The crosstab illustrates this. The crosstab is a simple but very useful tool for examining causal relations among categorical variables. Let's consider another example. Is there a relationship between ideology and frequency of attendance at religious services in our sample of U.S. elites? Inference with Categorical Data If we want to use the contingency table to make a claim about the relationship between variables in the population, then we need to use a significance test with the crosstabulation. We follow the usual steps of hypothesis testing. First we state the hypotheses. For example, let's consider an example we've discussed before. If our research question is "does approval for the President depend on whether or not one lives in urban places?" then we can formulate the hypotheses: H0: There is no relationship between urban residence and approval of the President H1: There is a relationship beween urban residence and approval of the President Using public opinion data (in this case, the ABC2010 dataset), we can calculate the following crosstabulation: ``` | USR2 Q1r | 0 | 1 | Row Total | -------------|-----------|-----------|-----------| 0 | 399 | 105 | 504 | | 0.519 | 0.447 | | -------------|-----------|-----------|-----------| 1 | 370 | 130 | 500 | | 0.481 | 0.553 | | -------------|-----------|-----------|-----------| Column Total | 769 | 235 | 1004 | | 0.766 | 0.234 | | -------------|-----------|-----------|-----------| ``` In this table, 0 indicates "not urban" for USR2 and "disapprove" for Q1r. The significance test is: ```Pearson's Chi-squared test ------------------------------------------------------------ Chi^2 = 3.737326 d.f. = 1 p = 0.05320955 Pearson's Chi-squared test with Yates' continuity correction ------------------------------------------------------------ Chi^2 = 3.454688 d.f. = 1 p = 0.06307263 Fisher's Exact Test for Count Data ------------------------------------------------------------ Sample estimate odds ratio: 1.334739 Alternative hypothesis: true odds ratio is not equal to 1 p = 0.06226612 95% confidence interval: 0.9851545 1.811241 Alternative hypothesis: true odds ratio is less than 1 p = 0.977726 95% confidence interval: 0 1.72726 Alternative hypothesis: true odds ratio is greater than 1 p = 0.03147629 95% confidence interval: 1.032496 Inf ``` We'll use the Pearson's chi-squared test. In the case of a 2x2 table, we use the Yates correction. If the result is statistically significant, we can use the odds and odds ratio to discuss the strength of the linear relationship between urban residence and approval of the President. The R command to produce this table (assuming you loaded the ABC2010 dataset and attached it) is: `CrossTable(USR2, Q1r, prop.r=F, prop.c=F, prop.t=F, prop.chisq=F, chisq = T, fisher = T)` Let's look at another example, this time from the CBS2011 data: the R code is ```CBS2011<-read.csv("http://www.courseserve.info/files/CBS2011r.csv") attach(CBS2011) CrossTable(Q1[URBN==1 | URBN==3], URBN[URBN==1 | URBN==3], prop.r=F, prop.t=F, prop.chisq=F, chisq=T, fisher=T)``` which produces ``` | URBN[URBN == 1 | URBN == 3] Q1[URBN == 1 | URBN == 3] | 1 | 3 | Row Total | --------------------------|-----------|-----------|-----------| 1 | 43 | 140 | 183 | | 0.717 | 0.496 | | --------------------------|-----------|-----------|-----------| 2 | 17 | 142 | 159 | | 0.283 | 0.504 | | --------------------------|-----------|-----------|-----------| Column Total | 60 | 282 | 342 | | 0.175 | 0.825 | | --------------------------|-----------|-----------|-----------| Statistics for All Table Factors Pearson's Chi-squared test ------------------------------------------------------------ Chi^2 = 9.644134 d.f. = 1 p = 0.001899572 Pearson's Chi-squared test with Yates' continuity correction ------------------------------------------------------------ Chi^2 = 8.779236 d.f. = 1 p = 0.003046787 Fisher's Exact Test for Count Data ------------------------------------------------------------ Sample estimate odds ratio: 2.558731 Alternative hypothesis: true odds ratio is not equal to 1 p = 0.002556915 95% confidence interval: 1.353408 5.025652 Alternative hypothesis: true odds ratio is less than 1 p = 0.9995208 95% confidence interval: 0 4.523515 Alternative hypothesis: true odds ratio is greater than 1 p = 0.001330111 95% confidence interval: 1.484226 Inf ``` Rather than recode the URBN variable, which has more than 2 categories, I told R to select only cases where the value of URBN was 1 (large central city) or 3 (suburb). Q1 is again a measure of approval of the President, where 1 is "approve". To make sure there are the same number of cases for both variables, you need to use the selection code (`[URBN==1 | URBN==3]`) when identifying both variables in the `CrossTable()` function. If you want to test a relationship for variables with more than two categories you can use the chi-squared test without the odds. (Odds are calculated in the `CrossTable` function with the `fisher=T` option.) For example, the command `CrossTable(Q1[URBN!=4], URBN[URBN!=4], prop.r=F, prop.t=F, prop.chisq=F, chisq=T)` produces: ``` | URBN[URBN != 4] Q1[URBN != 4] | 1 | 2 | 3 | 5 | Row Total | --------------|-----------|-----------|-----------|-----------|-----------| 1 | 43 | 71 | 140 | 73 | 327 | | 0.717 | 0.497 | 0.496 | 0.403 | | --------------|-----------|-----------|-----------|-----------|-----------| 2 | 17 | 72 | 142 | 108 | 339 | | 0.283 | 0.503 | 0.504 | 0.597 | | --------------|-----------|-----------|-----------|-----------|-----------| Column Total | 60 | 143 | 282 | 181 | 666 | | 0.090 | 0.215 | 0.423 | 0.272 | | --------------|-----------|-----------|-----------|-----------|-----------| Statistics for All Table Factors Pearson's Chi-squared test ------------------------------------------------------------ Chi^2 = 17.84538 d.f. = 3 p = 0.0004733554 ``` In this case, I told R to exclude cases where URBN is 4 (other) because there were no cases. All materials on this site are copyright © 2001, by Professor Timothy Shortell, except those retained by their original owner. No infringement is intended or implied. All rights reserved. Please let me know if you link to this site or use these materials.