An Introduction to Data Analysis & Presentation Prof. Timothy Shortell, Sociology, Brooklyn College Regression The correlation coefficient is a useful way to describe the strength and direction of the relationship between the independent and dependent variables. The square of the correlation tells us something about the extent to which variation in the independent variable accounts for, or explains variation in the dependent variable. We can build on this knowledge by extending the notion of correlation into linear regression.
The correlation coefficient is 0.7153.
We can plot a line on this scatterplot that represents the line of best fit for the data -- in other words, a line that summarizes the relationship between number of artists and amount of grant funding. This is the regression line. We say that we have regressed grant funding on number of artists.
The regression line is a mathematical model of the relationship between the independent and dependent variables. It is defined as:
The model represents the true relationship between these variables. We can only estimate it with our sample data. We write the equation as: Each point on the line represents a predicted value of the dependent variable for some value of the independent variable.
In this example, the equation for the regression line is: This is the line that best fits the data. This means that the sum of the squared errors is at a minimum. An error, in this case, is the difference between a predicted value on the dependent variable and the actual value of the dependent variable for a given value of the independent variable. In other words:
Let's see how all this fits together:
The square of the correlation coefficient was defined as the amount of variation in the dependent variable accounted for by variation in the independent variable. This is also true of R-square, which is the square of the multiple correlation coefficient. (When there is only one independent variable, as in the case of bivariate regression, r-square and R-square are equal.) In our example, R-square equals 0.5117. More than half of the variation in amount of grant funding is accounted for by variation in number of artists. This is just what the correlation statistics tell us. What more can we learn from regression? With regression, we get an estimate of the precise relationship between the independent variable and the dependent variable. The unstandardized regression coefficient, or B, indicates the amount of change in the dependent variable associated with a one unit change in the independent variable. In our example, B equals 170.04. This means that the amount of grant funding received by these cities increased by $170,040 for every additional 1,000 artists living in the city. The constant term, or y-intercept, is interpreted as the predicted value of the dependent variable when the independent variable is zero. In our example, a equals 15.54. This means that the hypothetical city with no artists (oh my!) would receive -$15,540 in arts funding. Of course, it is not likely that a community would have zero artists, and it is impossible for a city to receive a negative amount of funding. The constant term is sometimes useful, though, as an indication of a minimum case.
The null hypothesis for a regression model is that the independent variable is unrelated to -- explains none of the variance in -- the dependent variable. We use an F-test on the value of R-square to test the null hypothesis. If the significance of the F-score is less than or equal to 0.05, reject the null hypothesis. In our example, the F-score is 128.90, and its significance is less than 0.0000. We can reject the null hypothesis.
First, we check the F-test. Is the model statistically significant? The probability of the model under the null hypothesis is less 0.05, so we reject the null hypothesis. The regression coefficient shows that for every $1 increase in income per capita, we would expect an increase of 0.001 years in life expectancy. Because an increment of $1 in income per capita is not very meaningful sociologically, we can express the relationship in terms of a larger increment, such as $1,000. In this case, we would multiply the coefficient by the same amount (1,000): 0.001 x 1,000 = 1. So for every $1,000 increase in income per capita we would expect an increase of 1 year in life expectancy. The goodness of fit indicates that variation in income per capita accounts for 56% of the variation in life expectancy. This is a pretty good model considering that we have only one independent variable.
All materials on this site are copyright © 2001, by Professor Timothy Shortell, except those retained by their original owner. No infringement is intended or implied. All rights reserved. Please let me know if you link to this site or use these materials. |