Data Analysis & Interpretation 3.4: Testing a Logistic Regression Model

Week 4

This week’s assignment is to test a logistic regression model.

Summarize in a few sentences 1) what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary. 2) Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable. 3) Discuss whether or not there was evidence of confounding for the association between your primary explanatory and the response variable.

My Logistic Regression Model

The response variable, which is a binary categorical variable, is Metropolitan. Currently, 1=Metropolitan and 0=Non-Metropolitan. However, my hypothesis, positively stated, is that higher percentages of poverty are associated with non-metropolitan counties. In other words, a non-metropolitan county is more likely to have higher poverty. So I will recode Metropolitan so that 1=Non-metropolitan and 0=Metropolitan. My primary explanatory variable is Poverty, followed by Crime Rate and Region (0=South, 1=Northeast, 2=Midwest, 3=West).

Data Preparation

In [134]:
# create subset
sub13 = sub1.loc[:,('Metropolitan','Poverty','Crime Rate','Region')]

# remove space from Crime Rate
sub13.rename(columns={'Crime Rate':'Crime_Rate'}, inplace=True)

sub13.head()
Out[134]:
Metropolitan Poverty Crime_Rate Region
0 1 13.4 391.04 South
1 1 16.8 212.35 South
2 0 24.6 683.65 South
3 0 16.7 177.29 South
4 0 17.0 993.20 South
In [135]:
# Recode Metropolitan
def recode_metro(row):
    if row['Metropolitan'] == 1:
        return 0
    else:
        return 1

sub13['Metropolitan'] = sub13.apply(lambda row: recode_metro(row), axis=1)

sub13.head()
Out[135]:
Metropolitan Poverty Crime_Rate Region
0 0 13.4 391.04 South
1 0 16.8 212.35 South
2 1 24.6 683.65 South
3 1 16.7 177.29 South
4 1 17.0 993.20 South
In [136]:
# add column that recodes Region
def code_region(row):
    if row['Region'] == 'South':
        return 0
    if row['Region'] == 'Northeast':
        return 1
    if row['Region'] == 'Midwest':
        return 2
    else:
        return 3
    
sub13['Region_Code'] = sub13.apply(lambda row: code_region(row), axis=1)

# check
print(sub13[sub13['Region']=='South'].head())
print(sub13[sub13['Region']=='Northeast'].head())
print(sub13[sub13['Region']=='Midwest'].head())
print(sub13[sub13['Region']=='West'].head())
   Metropolitan  Poverty  Crime_Rate Region  Region_Code
0             0     13.4      391.04  South            0
1             0     16.8      212.35  South            0
2             1     24.6      683.65  South            0
3             1     16.7      177.29  South            0
4             1     17.0      993.20  South            0
     Metropolitan  Poverty  Crime_Rate     Region  Region_Code
650             0     15.7      141.54  Northeast            1
651             1     18.5       81.42  Northeast            1
652             0     11.6      200.62  Northeast            1
653             1     14.8      180.91  Northeast            1
654             1     13.5      259.80  Northeast            1
     Metropolitan  Poverty  Crime_Rate   Region  Region_Code
340             1     10.9      291.58  Midwest            2
341             1     12.6      261.29  Midwest            2
342             0     22.8      293.54  Midwest            2
343             1     13.8      404.95  Midwest            2
344             0     17.1       21.56  Midwest            2
    Metropolitan  Poverty  Crime_Rate Region  Region_Code
21             1     36.6      201.04   West            3
22             0     17.9      699.60   West            3
23             0     22.7      427.21   West            3
24             1     22.7     1277.15   West            3
25             1     22.6      465.15   West            3

First Model

In [137]:
# run first model summary
lreg1 = smf.logit(formula='Metropolitan ~ Poverty', data=sub13).fit()
print(lreg1.summary2())
Optimization terminated successfully.
         Current function value: 0.636443
         Iterations 5
                         Results: Logit
================================================================
Model:              Logit            No. Iterations:   5.0000   
Dependent Variable: Metropolitan     Pseudo R-squared: 0.033    
Date:               2018-10-01 22:23 AIC:              2655.4204
No. Observations:   2083             BIC:              2666.7036
Df Model:           1                Log-Likelihood:   -1325.7  
Df Residuals:       2081             LL-Null:          -1370.6  
Converged:          1.0000           Scale:            1.0000   
-----------------------------------------------------------------
              Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
-----------------------------------------------------------------
Intercept    -0.6639    0.1381  -4.8077  0.0000  -0.9346  -0.3933
Poverty       0.0752    0.0083   9.0218  0.0000   0.0589   0.0916
================================================================

In [138]:
# calculate odds ratio of first model
lreg1_OR = np.exp(lreg1.params)
print(lreg1_OR)
Intercept    0.514819
Poverty      1.078153
dtype: float64
In [139]:
# calculate upper and lower confidence intervals of first model
params1 = lreg1.params
conf1 = lreg1.conf_int()
conf1['OR'] = params1
conf1.columns = ['Lower CI','Upper CI','OR']
print(np.exp(conf1))
           Lower CI  Upper CI        OR
Intercept  0.392740  0.674847  0.514819
Poverty    1.060671  1.095924  1.078153

Summary of First Model

The regression of Poverty is statistically significant. Counties in my sample are 1.08 times more likely to be non-metropolitan as Poverty increases.

Second Model

In [140]:
# run second model summary
lreg2 = smf.logit(formula='Metropolitan ~ Poverty + Crime_Rate', data=sub13).fit()
print(lreg2.summary2())
Optimization terminated successfully.
         Current function value: 0.636281
         Iterations 5
                         Results: Logit
================================================================
Model:              Logit            No. Iterations:   5.0000   
Dependent Variable: Metropolitan     Pseudo R-squared: 0.033    
Date:               2018-10-01 22:23 AIC:              2656.7447
No. Observations:   2083             BIC:              2673.6694
Df Model:           2                Log-Likelihood:   -1325.4  
Df Residuals:       2080             LL-Null:          -1370.6  
Converged:          1.0000           Scale:            1.0000   
-----------------------------------------------------------------
              Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
-----------------------------------------------------------------
Intercept    -0.6382    0.1417  -4.5030  0.0000  -0.9160  -0.3604
Poverty       0.0766    0.0085   8.9822  0.0000   0.0599   0.0933
Crime_Rate   -0.0001    0.0001  -0.8237  0.4101  -0.0002   0.0001
================================================================

In [141]:
# calculate odds ratio of second model
lreg2_OR = np.exp(lreg2.params)
print(lreg2_OR)
Intercept     0.528252
Poverty       1.079641
Crime_Rate    0.999938
dtype: float64
In [142]:
# calculate upper and lower confidence intervals of second model
params2 = lreg2.params
conf2 = lreg2.conf_int()
conf2['OR'] = params2
conf2.columns = ['Lower CI','Upper CI','OR']
print(np.exp(conf2))
            Lower CI  Upper CI        OR
Intercept   0.400135  0.697391  0.528252
Poverty     1.061739  1.097845  1.079641
Crime_Rate  0.999789  1.000086  0.999938

Summary of Second Model

Crime Rate is not statistically significant in the regression model.

Third Model

In [143]:
# run third model summary; South = treatment reference
lreg3 = smf.logit(formula='Metropolitan ~ Poverty + Crime_Rate + C(Region_Code)', data=sub13).fit()
print(lreg3.summary2())
Optimization terminated successfully.
         Current function value: 0.610769
         Iterations 5
                          Results: Logit
===================================================================
Model:               Logit             No. Iterations:    5.0000   
Dependent Variable:  Metropolitan      Pseudo R-squared:  0.072    
Date:                2018-10-01 22:23  AIC:               2556.4632
No. Observations:    2083              BIC:               2590.3126
Df Model:            5                 Log-Likelihood:    -1272.2  
Df Residuals:        2077              LL-Null:           -1370.6  
Converged:           1.0000            Scale:             1.0000   
-------------------------------------------------------------------
                     Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------------------------------------------
Intercept           -1.5863   0.1991 -7.9663 0.0000 -1.9766 -1.1960
C(Region_Code)[T.1] -0.1641   0.2088 -0.7859 0.4319 -0.5733  0.2452
C(Region_Code)[T.2]  1.1367   0.1254  9.0640 0.0000  0.8909  1.3825
C(Region_Code)[T.3]  0.4128   0.1435  2.8773 0.0040  0.1316  0.6940
Poverty              0.1054   0.0097 10.8517 0.0000  0.0864  0.1244
Crime_Rate           0.0000   0.0001  0.4400 0.6599 -0.0001  0.0002
===================================================================

In [144]:
# calculate odds ratio of third model
lreg3_OR = np.exp(lreg3.params)
print(lreg3_OR)
Intercept              0.204685
C(Region_Code)[T.1]    0.848659
C(Region_Code)[T.2]    3.116600
C(Region_Code)[T.3]    1.511065
Poverty                1.111149
Crime_Rate             1.000036
dtype: float64
In [145]:
# calculate upper and lower confidence intervals of third model
params3 = lreg3.params
conf3 = lreg3.conf_int()
conf3['OR'] = params3
conf3.columns = ['Lower CI','Upper CI','OR']
print(np.exp(conf3))
                     Lower CI  Upper CI        OR
Intercept            0.138545  0.302398  0.204685
C(Region_Code)[T.1]  0.563635  1.277815  0.848659
C(Region_Code)[T.2]  2.437416  3.985037  3.116600
C(Region_Code)[T.3]  1.140670  2.001733  1.511065
Poverty              1.090197  1.132503  1.111149
Crime_Rate           0.999875  1.000197  1.000036

Summary of Third Model

Northeast (T.1) is not statistically significant; however, Midwest (T.2) and West (T.3) are. Since the Midwest and West are statistically significant in comparison to South, we can conclude that these regions have significantly more non-metropolitan counties than the South. Counties in the Midwest are 3.12 times more likely to be non-metropolitan after controlling for the other variables. Counties in the West are 1.5 times more likely to be non-metropolitan after controlling for the other variables. The confidence intervals of Midwest and West do not overlap, so we conclude that Midwest counties have a stronger association with non-metropolitan counties. Counties are 1.11 times more likely to be non-metropolitan as Poverty increases after controlling for the other variables.

Fourth Model

In [146]:
# run fourth model summary; Northeast = treatment reference
lreg4 = smf.logit(formula='Metropolitan ~ Poverty + Crime_Rate + C(Region_Code, Treatment(reference=1))', data=sub13).fit()
print(lreg4.summary2())
Optimization terminated successfully.
         Current function value: 0.610769
         Iterations 5
                                      Results: Logit
===========================================================================================
Model:                       Logit                     No. Iterations:            5.0000   
Dependent Variable:          Metropolitan              Pseudo R-squared:          0.072    
Date:                        2018-10-01 22:23          AIC:                       2556.4632
No. Observations:            2083                      BIC:                       2590.3126
Df Model:                    5                         Log-Likelihood:            -1272.2  
Df Residuals:                2077                      LL-Null:                   -1370.6  
Converged:                   1.0000                    Scale:                     1.0000   
-------------------------------------------------------------------------------------------
                                             Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------------------------------------------------------------------
Intercept                                   -1.7504   0.2229 -7.8543 0.0000 -2.1872 -1.3136
C(Region_Code, Treatment(reference=1))[T.0]  0.1641   0.2088  0.7859 0.4319 -0.2452  0.5733
C(Region_Code, Treatment(reference=1))[T.2]  1.3008   0.2040  6.3755 0.0000  0.9009  1.7007
C(Region_Code, Treatment(reference=1))[T.3]  0.5769   0.2261  2.5519 0.0107  0.1338  1.0200
Poverty                                      0.1054   0.0097 10.8517 0.0000  0.0864  0.1244
Crime_Rate                                   0.0000   0.0001  0.4400 0.6599 -0.0001  0.0002
===========================================================================================

In [147]:
# calculate odds ratio of fourth model
lreg4_OR = np.exp(lreg4.params)
print(lreg4_OR)
Intercept                                      0.173707
C(Region_Code, Treatment(reference=1))[T.0]    1.178330
C(Region_Code, Treatment(reference=1))[T.2]    3.672383
C(Region_Code, Treatment(reference=1))[T.3]    1.780533
Poverty                                        1.111149
Crime_Rate                                     1.000036
dtype: float64
In [148]:
# calculate upper and lower confidence intervals of fourth model
params4 = lreg4.params
conf4 = lreg4.conf_int()
conf4['OR'] = params4
conf4.columns = ['Lower CI','Upper CI','OR']
print(np.exp(conf4))
                                             Lower CI  Upper CI        OR
Intercept                                    0.112233  0.268853  0.173707
C(Region_Code, Treatment(reference=1))[T.0]  0.782586  1.774198  1.178330
C(Region_Code, Treatment(reference=1))[T.2]  2.461907  5.478031  3.672383
C(Region_Code, Treatment(reference=1))[T.3]  1.143196  2.773189  1.780533
Poverty                                      1.090197  1.132503  1.111149
Crime_Rate                                   0.999875  1.000197  1.000036

Summary of Fourth Model

The Midwest and West have significantly more non-metropolitan counties than the Northeast. The South is statistically insignificant in comparison to the Northeast. There is a slight overlap between confidence intervals of the Midwest and West regions. If this overlap did not exist, then we could more confidently assert that Midwest counties have a stronger association with non-metropolitan counties.

Fifth Model

In [149]:
# run fifth model summary; Midwest = treatment reference
lreg5 = smf.logit(formula='Metropolitan ~ Poverty + Crime_Rate + C(Region_Code, Treatment(reference=2))', data=sub13).fit()
print(lreg5.summary2())
Optimization terminated successfully.
         Current function value: 0.610769
         Iterations 5
                                      Results: Logit
===========================================================================================
Model:                       Logit                     No. Iterations:            5.0000   
Dependent Variable:          Metropolitan              Pseudo R-squared:          0.072    
Date:                        2018-10-01 22:23          AIC:                       2556.4632
No. Observations:            2083                      BIC:                       2590.3126
Df Model:                    5                         Log-Likelihood:            -1272.2  
Df Residuals:                2077                      LL-Null:                   -1370.6  
Converged:                   1.0000                    Scale:                     1.0000   
-------------------------------------------------------------------------------------------
                                             Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------------------------------------------------------------------
Intercept                                   -0.4495   0.1549 -2.9025 0.0037 -0.7531 -0.1460
C(Region_Code, Treatment(reference=2))[T.0] -1.1367   0.1254 -9.0640 0.0000 -1.3825 -0.8909
C(Region_Code, Treatment(reference=2))[T.1] -1.3008   0.2040 -6.3755 0.0000 -1.7007 -0.9009
C(Region_Code, Treatment(reference=2))[T.3] -0.7239   0.1536 -4.7142 0.0000 -1.0249 -0.4230
Poverty                                      0.1054   0.0097 10.8517 0.0000  0.0864  0.1244
Crime_Rate                                   0.0000   0.0001  0.4400 0.6599 -0.0001  0.0002
===========================================================================================

In [150]:
# calculate odds ratio of fifth model
lreg5_OR = np.exp(lreg5.params)
print(lreg5_OR)
Intercept                                      0.637920
C(Region_Code, Treatment(reference=2))[T.0]    0.320862
C(Region_Code, Treatment(reference=2))[T.1]    0.272303
C(Region_Code, Treatment(reference=2))[T.3]    0.484844
Poverty                                        1.111149
Crime_Rate                                     1.000036
dtype: float64
In [151]:
# calculate upper and lower confidence intervals of fifth model
params5 = lreg5.params
conf5 = lreg5.conf_int()
conf5['OR'] = params5
conf5.columns = ['Lower CI','Upper CI','OR']
print(np.exp(conf5))
                                             Lower CI  Upper CI        OR
Intercept                                    0.470905  0.864170  0.637920
C(Region_Code, Treatment(reference=2))[T.0]  0.250939  0.410270  0.320862
C(Region_Code, Treatment(reference=2))[T.1]  0.182547  0.406189  0.272303
C(Region_Code, Treatment(reference=2))[T.3]  0.358831  0.655110  0.484844
Poverty                                      1.090197  1.132503  1.111149
Crime_Rate                                   0.999875  1.000197  1.000036

Summary of Fifth Model

The South, Northeast, and West are statistically significant in comparison to the Midwest. However, these regions have odds ratios less than 1, meaning that as the explanatory variable increases, the response variable being non-metropolitan is less likely. Again, the confidence intervals for these regions overlap one another.

Sixth Model

In [152]:
# run sixth model summary; West = treatment reference
lreg6 = smf.logit(formula='Metropolitan ~ Poverty + Crime_Rate + C(Region_Code, Treatment(reference=3))', data=sub13).fit()
print(lreg6.summary2())
Optimization terminated successfully.
         Current function value: 0.610769
         Iterations 5
                                      Results: Logit
===========================================================================================
Model:                       Logit                     No. Iterations:            5.0000   
Dependent Variable:          Metropolitan              Pseudo R-squared:          0.072    
Date:                        2018-10-01 22:23          AIC:                       2556.4632
No. Observations:            2083                      BIC:                       2590.3126
Df Model:                    5                         Log-Likelihood:            -1272.2  
Df Residuals:                2077                      LL-Null:                   -1370.6  
Converged:                   1.0000                    Scale:                     1.0000   
-------------------------------------------------------------------------------------------
                                             Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------------------------------------------------------------------
Intercept                                   -1.1735   0.2012 -5.8336 0.0000 -1.5677 -0.7792
C(Region_Code, Treatment(reference=3))[T.0] -0.4128   0.1435 -2.8773 0.0040 -0.6940 -0.1316
C(Region_Code, Treatment(reference=3))[T.1] -0.5769   0.2261 -2.5519 0.0107 -1.0200 -0.1338
C(Region_Code, Treatment(reference=3))[T.2]  0.7239   0.1536  4.7142 0.0000  0.4230  1.0249
Poverty                                      0.1054   0.0097 10.8517 0.0000  0.0864  0.1244
Crime_Rate                                   0.0000   0.0001  0.4400 0.6599 -0.0001  0.0002
===========================================================================================

In [153]:
# calculate odds ratio
lreg6_OR = np.exp(lreg6.params)
print(lreg6_OR)
Intercept                                      0.309292
C(Region_Code, Treatment(reference=3))[T.0]    0.661785
C(Region_Code, Treatment(reference=3))[T.1]    0.561630
C(Region_Code, Treatment(reference=3))[T.2]    2.062519
Poverty                                        1.111149
Crime_Rate                                     1.000036
dtype: float64
In [154]:
# calculate upper and lower confidence intervals
params6 = lreg6.params
conf6 = lreg6.conf_int()
conf6['OR'] = params6
conf6.columns = ['Lower CI','Upper CI','OR']
print(np.exp(conf6))
                                             Lower CI  Upper CI        OR
Intercept                                    0.208517  0.458769  0.309292
C(Region_Code, Treatment(reference=3))[T.0]  0.499567  0.876678  0.661785
C(Region_Code, Treatment(reference=3))[T.1]  0.360596  0.874741  0.561630
C(Region_Code, Treatment(reference=3))[T.2]  1.526461  2.786829  2.062519
Poverty                                      1.090197  1.132503  1.111149
Crime_Rate                                   0.999875  1.000197  1.000036

Summary of Sixth Model

While all regions are statistically significant in comparison to the West, only the Midwest has a positive relationship with non-metropolitan and the confidence intervals of the Midwest do not overlap with the South or Northeast. We can conclude that Midwest counties are 2.06 times more likely to be non-metropolitan after controlling for the other variables.

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s