Data Analysis & Interpretation 3.3: Testing a Multiple Regression Model

Week 3

This week’s assignment is to test a multiple regression model.

Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary. 2) Report whether your results supported your hypothesis for the association between your primary explanatory variable and the response variable. 3) Discuss whether there was evidence of confounding for the association between your primary explanatory and response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables); and 4) generate the following regression diagnostic plots:

a) q-q plot

b) standardized residuals for all observations

c) leverage plot

d) Write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outliers.

My Multiple Regression Model

The response variable for my multiple regression model is going to be Crime Rate. The explantory variables are going to be Poverty, Unemployement, White, Black, and Hispanic. I will perform this in steps, first doing a simple linear regression model with Poverty and Crime Rate and then add the additional explanatory variables one at a time.

 

In [123]:
# create subset dataframe
sub12 = sub1.loc[:,('Poverty','Metropolitan','Crime Rate')]
sub12['Unemployment'] = df['Unemployment']
sub12['White'] = df['White']
sub12['Black'] = df['Black']
sub12['Hispanic'] = df['Hispanic']
sub12['Native'] = df['Native']
sub12['Asian'] = df['Asian']
sub12.rename(columns={'Crime Rate':'Crime_Rate'}, inplace=True)
sub12.head()
Out[123]:
Poverty Metropolitan Crime_Rate Unemployment White Black Hispanic Native Asian
0 13.4 1 391.04 7.5 83.1 9.5 4.5 0.6 0.7
1 16.8 1 212.35 8.3 74.5 21.4 2.2 0.4 0.1
2 24.6 0 683.65 18.0 22.2 70.7 4.4 1.2 0.2
3 16.7 0 177.29 9.4 79.9 14.4 3.2 0.7 0.0
4 17.0 0 993.20 8.3 92.5 2.9 2.3 0.2 0.4

Model 1: Poverty vs. Crime Rate

In [124]:
# run first model
mod1 = smf.ols(formula='Crime_Rate ~ Poverty', data=sub12).fit()

# view model summary
print(mod1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Crime_Rate   R-squared:                       0.030
Model:                            OLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     63.45
Date:                Sun, 30 Sep 2018   Prob (F-statistic):           2.69e-15
Time:                        16:48:37   Log-Likelihood:                -16338.
No. Observations:                2083   AIC:                         3.268e+04
Df Residuals:                    2081   BIC:                         3.269e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    478.7761     38.657     12.385      0.000     402.966     554.587
Poverty       17.5892      2.208      7.965      0.000      13.259      21.920
==============================================================================
Omnibus:                      768.696   Durbin-Watson:                   1.431
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3693.747
Skew:                           1.701   Prob(JB):                         0.00
Kurtosis:                       8.567   Cond. No.                         50.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [125]:
# plot correlation between Poverty and Crime Rate
sns.regplot('Poverty', 'Crime_Rate', data=sub12, color='gray')
sns.regplot('Poverty', 'Crime_Rate', data=sub12, order=2, color='steelblue')
sns.despine()
plt.title('Poverty vs. Crime Rate', loc='left', fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('poverty_cr_mod1.png')

poverty_cr_mod1

In [126]:

# run model 1a
mod1a = smf.ols(formula='Crime_Rate ~ Poverty + I(Poverty ** 2)', data=sub12).fit()

# view model summary
print(mod1a.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Crime_Rate   R-squared:                       0.044
Model:                            OLS   Adj. R-squared:                  0.043
Method:                 Least Squares   F-statistic:                     47.40
Date:                Sun, 30 Sep 2018   Prob (F-statistic):           7.42e-21
Time:                        16:48:38   Log-Likelihood:                -16323.
No. Observations:                2083   AIC:                         3.265e+04
Df Residuals:                    2080   BIC:                         3.267e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          95.1528     79.412      1.198      0.231     -60.583     250.888
Poverty            64.3354      8.750      7.352      0.000      47.175      81.496
I(Poverty ** 2)    -1.2498      0.226     -5.518      0.000      -1.694      -0.806
==============================================================================
Omnibus:                      778.862   Durbin-Watson:                   1.438
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3799.106
Skew:                           1.721   Prob(JB):                         0.00
Kurtosis:                       8.651   Cond. No.                     2.30e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Model 1 Summary

In this model, we test the significance of the linear and polynomial relationship of Poverty with Crime Rate. We can conclude that there exists a statistically significant relationship. Adding the polynomial regression of Poverty to the model slightly improved the R-squared value.

Model 2: Poverty and Unemployment vs. Crime Rate

In [127]:
# run second model
mod2 = smf.ols(formula='Crime_Rate ~ Poverty + Unemployment', data=sub12).fit()

# view model summary
print(mod2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Crime_Rate   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.050
Method:                 Least Squares   F-statistic:                     55.44
Date:                Sun, 30 Sep 2018   Prob (F-statistic):           3.48e-24
Time:                        16:48:38   Log-Likelihood:                -16315.
No. Observations:                2083   AIC:                         3.264e+04
Df Residuals:                    2080   BIC:                         3.265e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      407.1873     39.673     10.264      0.000     329.384     484.991
Poverty          5.6802      2.802      2.027      0.043       0.185      11.175
Unemployment    35.1265      5.176      6.787      0.000      24.976      45.277
==============================================================================
Omnibus:                      731.666   Durbin-Watson:                   1.452
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3355.959
Skew:                           1.625   Prob(JB):                         0.00
Kurtosis:                       8.301   Cond. No.                         57.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 2 Summary

Even though Poverty is statistically significant, the P-value increased from less than 0.001 to 0.043 after adding Unemployment to the model. Unemployment is statistically significant with a P-value less than 0.001. The R-squared value slightly increased after adding Unemployment.

Model 3: Poverty, Unemployment, and White vs. Crime Rate

In [128]:
# run third model
mod3 = smf.ols(formula='Crime_Rate ~ Poverty + Unemployment + White', data=sub12).fit()

# view model summary
print(mod3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Crime_Rate   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.050
Method:                 Least Squares   F-statistic:                     37.47
Date:                Sun, 30 Sep 2018   Prob (F-statistic):           1.41e-23
Time:                        16:48:38   Log-Likelihood:                -16315.
No. Observations:                2083   AIC:                         3.264e+04
Df Residuals:                    2079   BIC:                         3.266e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      505.1653     89.535      5.642      0.000     329.578     680.752
Poverty          4.8883      2.876      1.700      0.089      -0.752      10.528
Unemployment    34.1167      5.241      6.510      0.000      23.839      44.394
White           -0.9756      0.799     -1.221      0.222      -2.543       0.592
==============================================================================
Omnibus:                      722.120   Durbin-Watson:                   1.465
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3253.651
Skew:                           1.608   Prob(JB):                         0.00
Kurtosis:                       8.210   Cond. No.                         557.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 3 Summary

After adding White to the model, Poverty is now statistically insignificant with a P-value of 0.089. White has a negative relationship with Crime Rate and is also statistically insignificant with a P-value of 0.222. The R-squared value did not increase at all.

Model 4: Poverty, Unemployment, White, and Black vs. Crime Rate

In [129]:
mod4 = smf.ols(formula='Crime_Rate ~ Poverty + Unemployment + White + Black', data=sub12).fit()
print(mod4.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Crime_Rate   R-squared:                       0.072
Model:                            OLS   Adj. R-squared:                  0.071
Method:                 Least Squares   F-statistic:                     40.58
Date:                Sun, 30 Sep 2018   Prob (F-statistic):           8.74e-33
Time:                        16:48:38   Log-Likelihood:                -16291.
No. Observations:                2083   AIC:                         3.259e+04
Df Residuals:                    2078   BIC:                         3.262e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      263.2672     95.264      2.764      0.006      76.444     450.090
Poverty          3.6071      2.850      1.265      0.206      -1.983       9.197
Unemployment    28.2254      5.253      5.373      0.000      17.923      38.528
White            2.0303      0.903      2.248      0.025       0.259       3.801
Black            9.5103      1.381      6.886      0.000       6.802      12.219
==============================================================================
Omnibus:                      746.863   Durbin-Watson:                   1.537
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3709.946
Skew:                           1.633   Prob(JB):                         0.00
Kurtosis:                       8.664   Cond. No.                         600.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 4 Summary

The P-value for White has become statistically significant after adding Black to the model. Black is also statistically significant. After adding Black to the model, the R-squared value increased from 0.051 to 0.072.

Model 5: Poverty, Unemployment, White, Black, and Hispanic vs. Crime Rate

In [130]:
mod5 = smf.ols(formula='Crime_Rate ~ Poverty + Unemployment + White + Black + Hispanic', data=sub12).fit()
print(mod5.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Crime_Rate   R-squared:                       0.075
Model:                            OLS   Adj. R-squared:                  0.073
Method:                 Least Squares   F-statistic:                     33.92
Date:                Sun, 30 Sep 2018   Prob (F-statistic):           2.12e-33
Time:                        16:48:38   Log-Likelihood:                -16288.
No. Observations:                2083   AIC:                         3.259e+04
Df Residuals:                    2077   BIC:                         3.262e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     -186.5603    196.903     -0.947      0.344    -572.708     199.588
Poverty          4.1228      2.853      1.445      0.149      -1.473       9.718
Unemployment    29.4766      5.268      5.595      0.000      19.146      39.808
White            6.4852      1.931      3.359      0.001       2.699      10.272
Black           13.6156      2.092      6.508      0.000       9.513      17.719
Hispanic         5.5918      2.143      2.609      0.009       1.389       9.795
==============================================================================
Omnibus:                      748.611   Durbin-Watson:                   1.539
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3692.334
Skew:                           1.640   Prob(JB):                         0.00
Kurtosis:                       8.638   Cond. No.                     1.25e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.25e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Model 5 Summary

All of the ethnicity variables are statistically significant. Poverty is still statistically insignificant. The R-squared value slightly increased from 0.072 to 0.075 by adding Hispanic.

Final Model QQPlot

In [131]:
# import statsmodels.graphics.gofplots for qqplot
import statsmodels.api as sm

fig1 = sm.qqplot(mod5.resid, line='r')
sns.despine()
plt.title('Model Residuals', loc='left', fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('final_mod_residuals.png')
print(fig1)

final_mod_residuals

Final Model Standardized Residuals

In [132]:
stdres = pd.DataFrame(mod5.resid_pearson)
fig2 = plt.plot(stdres, 'o', ls='None')
plt.axhline(y=0, color='red')
plt.axhline(y=2, color='black')
plt.axhline(y=-2, color='black')
plt.xlabel('Observation Number')
plt.ylabel('Standardized Residual')
sns.despine()
plt.title('Standardized Residuals', loc='left', fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('std_resid.png')
print(fig2)

std_resid

Leverage Plot

In [133]:
fig3, ax = plt.subplots(figsize=(12,10))
fig3 = sm.graphics.influence_plot(mod5, size=8, ax=ax)
plt.text(0,7, 'Extreme outliers w/\nlow leverage', color='red', fontsize=12)
plt.text(0.05,2, 'High leverage non-outliers', color='red', fontsize=12)
sns.despine()
plt.tight_layout()
plt.savefig('mod_leverage_plot.png')
print(fig3)

mod_leverage_plot

Model Summary

The model residuals deviate from the linear regression at the lower and upper quantiles, suggesting that the residuals do not follow a normal distribution. From the standard residuals plot we can see that a large portion of the residuals are within 2 standard deviations. However, we also see that several extreme outliers exist on the positive end of the distribution. This communicates a weakness in the model, suggesting that key explanatory variables may be left out. Looking at the influence/leverage plot, we see several extreme outliers that have low leverage. Low-leverage outliers do not have significant influence on the estimation of the regression model.

This was an exercise of a multiple regression model. We’ve seen how to conduct a multiple regression model, how explanatory variables may serve as confounding variables to other explanatory variables, and how to test the performance of the model using regression plots.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s