Week 3
This week’s assignment is to test a multiple regression model.
Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary. 2) Report whether your results supported your hypothesis for the association between your primary explanatory variable and the response variable. 3) Discuss whether there was evidence of confounding for the association between your primary explanatory and response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables); and 4) generate the following regression diagnostic plots:
a) q-q plot
b) standardized residuals for all observations
c) leverage plot
d) Write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outliers.
My Multiple Regression Model
The response variable for my multiple regression model is going to be Crime Rate. The explantory variables are going to be Poverty, Unemployement, White, Black, and Hispanic. I will perform this in steps, first doing a simple linear regression model with Poverty and Crime Rate and then add the additional explanatory variables one at a time.
# create subset dataframe
sub12 = sub1.loc[:,('Poverty','Metropolitan','Crime Rate')]
sub12['Unemployment'] = df['Unemployment']
sub12['White'] = df['White']
sub12['Black'] = df['Black']
sub12['Hispanic'] = df['Hispanic']
sub12['Native'] = df['Native']
sub12['Asian'] = df['Asian']
sub12.rename(columns={'Crime Rate':'Crime_Rate'}, inplace=True)
sub12.head()
Model 1: Poverty vs. Crime Rate
# run first model
mod1 = smf.ols(formula='Crime_Rate ~ Poverty', data=sub12).fit()
# view model summary
print(mod1.summary())
# plot correlation between Poverty and Crime Rate
sns.regplot('Poverty', 'Crime_Rate', data=sub12, color='gray')
sns.regplot('Poverty', 'Crime_Rate', data=sub12, order=2, color='steelblue')
sns.despine()
plt.title('Poverty vs. Crime Rate', loc='left', fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('poverty_cr_mod1.png')
In [126]:
# run model 1a
mod1a = smf.ols(formula='Crime_Rate ~ Poverty + I(Poverty ** 2)', data=sub12).fit()
# view model summary
print(mod1a.summary())
Model 1 Summary
In this model, we test the significance of the linear and polynomial relationship of Poverty with Crime Rate. We can conclude that there exists a statistically significant relationship. Adding the polynomial regression of Poverty to the model slightly improved the R-squared value.
Model 2: Poverty and Unemployment vs. Crime Rate
# run second model
mod2 = smf.ols(formula='Crime_Rate ~ Poverty + Unemployment', data=sub12).fit()
# view model summary
print(mod2.summary())
Model 2 Summary
Even though Poverty is statistically significant, the P-value increased from less than 0.001 to 0.043 after adding Unemployment to the model. Unemployment is statistically significant with a P-value less than 0.001. The R-squared value slightly increased after adding Unemployment.
Model 3: Poverty, Unemployment, and White vs. Crime Rate
# run third model
mod3 = smf.ols(formula='Crime_Rate ~ Poverty + Unemployment + White', data=sub12).fit()
# view model summary
print(mod3.summary())
Model 3 Summary
After adding White to the model, Poverty is now statistically insignificant with a P-value of 0.089. White has a negative relationship with Crime Rate and is also statistically insignificant with a P-value of 0.222. The R-squared value did not increase at all.
Model 4: Poverty, Unemployment, White, and Black vs. Crime Rate
mod4 = smf.ols(formula='Crime_Rate ~ Poverty + Unemployment + White + Black', data=sub12).fit()
print(mod4.summary())
Model 4 Summary
The P-value for White has become statistically significant after adding Black to the model. Black is also statistically significant. After adding Black to the model, the R-squared value increased from 0.051 to 0.072.
Model 5: Poverty, Unemployment, White, Black, and Hispanic vs. Crime Rate
mod5 = smf.ols(formula='Crime_Rate ~ Poverty + Unemployment + White + Black + Hispanic', data=sub12).fit()
print(mod5.summary())
Model 5 Summary
All of the ethnicity variables are statistically significant. Poverty is still statistically insignificant. The R-squared value slightly increased from 0.072 to 0.075 by adding Hispanic.
Final Model QQPlot
# import statsmodels.graphics.gofplots for qqplot
import statsmodels.api as sm
fig1 = sm.qqplot(mod5.resid, line='r')
sns.despine()
plt.title('Model Residuals', loc='left', fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('final_mod_residuals.png')
print(fig1)
Final Model Standardized Residuals
stdres = pd.DataFrame(mod5.resid_pearson)
fig2 = plt.plot(stdres, 'o', ls='None')
plt.axhline(y=0, color='red')
plt.axhline(y=2, color='black')
plt.axhline(y=-2, color='black')
plt.xlabel('Observation Number')
plt.ylabel('Standardized Residual')
sns.despine()
plt.title('Standardized Residuals', loc='left', fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('std_resid.png')
print(fig2)
Leverage Plot
fig3, ax = plt.subplots(figsize=(12,10))
fig3 = sm.graphics.influence_plot(mod5, size=8, ax=ax)
plt.text(0,7, 'Extreme outliers w/\nlow leverage', color='red', fontsize=12)
plt.text(0.05,2, 'High leverage non-outliers', color='red', fontsize=12)
sns.despine()
plt.tight_layout()
plt.savefig('mod_leverage_plot.png')
print(fig3)
Model Summary
The model residuals deviate from the linear regression at the lower and upper quantiles, suggesting that the residuals do not follow a normal distribution. From the standard residuals plot we can see that a large portion of the residuals are within 2 standard deviations. However, we also see that several extreme outliers exist on the positive end of the distribution. This communicates a weakness in the model, suggesting that key explanatory variables may be left out. Looking at the influence/leverage plot, we see several extreme outliers that have low leverage. Low-leverage outliers do not have significant influence on the estimation of the regression model.
This was an exercise of a multiple regression model. We’ve seen how to conduct a multiple regression model, how explanatory variables may serve as confounding variables to other explanatory variables, and how to test the performance of the model using regression plots.