An Introduction to Data Analysis & Presentation Prof. Timothy Shortell, Sociology, Brooklyn College Multiple Linear Regression (OLS) We know that the social world is complex. Our models of the the phenomena we study should also be complex. Whenever we look at the effect of one independent variable on the dependent variable, we run the risk of over-simplification. We saw the way in which a significant bivariate correlation could be due to the effect of a third variable; we sensed the difficulty in interpreting any bivariate relationship, for just this reason. With multiple linear regression -- multiple regression, for short -- we have a way to specify a model that includes several independent variables.
The real explanatory power comes from the way in which the regression coefficients are interpreted. Multiple regression allows us to estimate the effect of an independent variable on the dependent variable, Because there are multiple independent variables in the model, we can compare their effects. In this way, we can estimate which independent variable has the most explanatory power.
If an important variable is omitted, however, its effects can't be controlled for, and the result is a mis-specified model. Interpretation of the results is misleading. Some predictors might get credit for having a strong relationship with the dependent variable, when in fact, the omitted variable is the true causal factor. At the same time, the researcher can not overload the model with too many variables. The number of variables in the model affects the likelihood of rejecting the null hypothesis: more variables in the model means a larger F-score (and therefore, a larger proportion of explained variance) is needed to achieve statistical significance. Too many unrelated variables in the model might lead the researcher to a type-II error. Another problem occurs when two or more of the predictors are themselves highly related. This is a condition called collinearity. When two predictors are very highly related -- as evidenced by a bivariate correlation above 0.9, or so -- it is mathematically impossible to precisely estimate the independent effects of each. The regression coefficients of each are therefore less reliable. The researcher has the responsibility of specifying the appropriate model. This entails including all of the relevant predictors, but not too many, and no predictors that are too highly related. Independent variables should be included in the model based on a theoretical argument, rather than empirical evidence alone.
Let's specify a model to predict the amount of arts funding received for U.S. communities. We believe that the amount of arts funding will be a function of community size, number of artists, mean age in the community, income per capita and percent minority population.
The F-score indicates the probability of arriving at the model based on the sample data under the null hypothesis. If the probability is less than or equal to 0.05, reject the null hypothesis. R-square is interpreted as the proportion of explained variance for the model -- that is, the the combined effects of all the predictors.
Using data from the UN on nations, calculating the model results in: We reject the null hypothesis. The predictors, taken together, explain 82.5% of the variation in life expectancy. With more than one predictor, we need to check to see which are significantly related to the dependent variable, and which are not. The significant F-score indicates that there is some sort of relationship between the predictors and the dependent variable. We need to determine what that relationship is. (Remember the case with ANOVA.) The effect of each predictor is subjected to a t-test to determine if the coefficient is significantly different from zero. If the regression coefficient, B, is zero, that predictor is not directly related to the dependent variable. Compare the probability of each t-score to 0.05, and reject the null hypothesis (that B equals zero) when the probability is less than or equal to alpha. In our example, only one predictor has a significant unique effect, the percentage of the population living in cities. We can interpret the value of the unstandardized coefficient, B, as the predicted effect on the dependent variable for a unit change on the independent variable. In this case, that would mean for a one percentage point change in the population living in cities, we would expect the life expectancy to change by .12 years. In many instances, a one unit change is not sociologically meaningful, so a larger unit change can be used to convey the effect. So, a ten percentage point change in the population living in cities would lead to a 1.17 year increase in life expectancy. (Remember to multiply the coefficient by the same value as the increment.) The unstandardized coefficients (B) are always in the units of the dependent variable.
This does
We need a way to compare the contributions of the predictors that is independent of how they are measured, in order to assess which are most important. We use the In our example, only one predictor is independently related to life expectancy. In this case, we don't need to examine the standardized coefficients.
At the same time, you must avoid interpreting the independent effects of the predictors that are not significant, as indicated by the t-tests. All materials on this site are copyright © 2001, by Professor Timothy Shortell, except those retained by their original owner. No infringement is intended or implied. All rights reserved. Please let me know if you link to this site or use these materials. |