An Introduction to Data Analysis & Presentation

Prof. Timothy Shortell, Sociology, Brooklyn College


The correlation coefficient is a useful way to describe the strength and direction of the relationship between the independent and dependent variables. The square of the correlation tells us something about the extent to which variation in the independent variable accounts for, or explains variation in the dependent variable.

We can build on this knowledge by extending the notion of correlation into linear regression.

The Regression Line
Consider the scatterplot of number of artists and amount of grant funding for 125 U.S. cities:

The correlation coefficient is 0.7153.

We can plot a line on this scatterplot that represents the line of best fit for the data -- in other words, a line that summarizes the relationship between number of artists and amount of grant funding.

This is the regression line. We say that we have regressed grant funding on number of artists.

The regression line is a mathematical model of the relationship between the independent and dependent variables. It is defined as:

The model represents the true relationship between these variables. We can only estimate it with our sample data. We write the equation as:

Each point on the line represents a predicted value of the dependent variable for some value of the independent variable.

In this example, the equation for the regression line is:

This is the line that best fits the data. This means that the sum of the squared errors is at a minimum. An error, in this case, is the difference between a predicted value on the dependent variable and the actual value of the dependent variable for a given value of the independent variable. In other words:

Let's see how all this fits together:

Interpreting the Regression Statistics
Once we have specified our model, we test how well it fits the sample data. We use R-square to assess the goodness-of-fit.

The square of the correlation coefficient was defined as the amount of variation in the dependent variable accounted for by variation in the independent variable. This is also true of R-square, which is the square of the multiple correlation coefficient. (When there is only one independent variable, as in the case of bivariate regression, r-square and R-square are equal.)

In our example, R-square equals 0.5117. More than half of the variation in amount of grant funding is accounted for by variation in number of artists.

This is just what the correlation statistics tell us. What more can we learn from regression?

With regression, we get an estimate of the precise relationship between the independent variable and the dependent variable. The unstandardized regression coefficient, or B, indicates the amount of change in the dependent variable associated with a one unit change in the independent variable. In our example, B equals 170.04. This means that the amount of grant funding received by these cities increased by $170,040 for every additional 1,000 artists living in the city.

The constant term, or y-intercept, is interpreted as the predicted value of the dependent variable when the independent variable is zero. In our example, a equals 15.54. This means that the hypothetical city with no artists (oh my!) would receive -$15,540 in arts funding. Of course, it is not likely that a community would have zero artists, and it is impossible for a city to receive a negative amount of funding. The constant term is sometimes useful, though, as an indication of a minimum case.

The Significance Test
The regression model is estimated with the sample data. We want the model to describe the population. Once again, we need a significance test to aid in the generalization of our results.

The null hypothesis for a regression model is that the independent variable is unrelated to -- explains none of the variance in -- the dependent variable.

We use an F-test on the value of R-square to test the null hypothesis. If the significance of the F-score is less than or equal to 0.05, reject the null hypothesis. In our example, the F-score is 128.90, and its significance is less than 0.0000. We can reject the null hypothesis.

Interpreting the Regression Results
The number of artists in the community is significantly related to the amount of grant funding. More than 50% of the variation in grant funding is explained by variation in number of artists. Amount of funding increases by about $170,000 for each additional 1,000 artists in the community.

An Example in SPSS
In this example, our dependent variable is life expectancy (in years). This is the variable we are trying to explain. Why do some nations have higher life expectancy and others have lower? It is important to remember the units of analysis. Our cases in this data are nations, not individuals. The causal factors that might explain why some individuals live longer are not the same as those that might explain life expectancy in nations. The independent variables are properties of nations, not individuals.

First, we check the F-test. Is the model statistically significant? The probability of the model under the null hypothesis is less 0.05, so we reject the null hypothesis. The regression coefficient shows that for every $1 increase in income per capita, we would expect an increase of 0.001 years in life expectancy. Because an increment of $1 in income per capita is not very meaningful sociologically, we can express the relationship in terms of a larger increment, such as $1,000. In this case, we would multiply the coefficient by the same amount (1,000): 0.001 x 1,000 = 1. So for every $1,000 increase in income per capita we would expect an increase of 1 year in life expectancy.

The goodness of fit indicates that variation in income per capita accounts for 56% of the variation in life expectancy. This is a pretty good model considering that we have only one independent variable.

Regression Analysis in Practice
The real power of regression analysis is illustrated when the research question involves more than one independent variable. Real social research almost always includes multiple causal factors, so the simple regression model we've just seen would not work well. As we will see in the next lecture, multiple regression allows us to estimate the independent or unique effect of each of the independent variables in the model.

All materials on this site are copyright © 2001, by Professor Timothy Shortell, except those retained by their original owner. No infringement is intended or implied. All rights reserved. Please let me know if you link to this site or use these materials.