Data Analysis & Interpretation 2.1: Analysis of Variance (ANOVA)

Course 2: Data Analysis Tools

This course builds on the previous one, exploring advanced statistical methods in the area of hypothesis testing: ANOVA, Chi-Square, and Pearson correlation. Keep in mind, that I did perform hypothesis testing in the previous course even though it was not required. However, I performed z-tests. So part of this course will consist of me performing different kinds of hypothesis tests on the same variables as before.

To view all assignment posts for this specialization program, click here.

The null and alternate hypotheses are the same for all the tests. They are as follows:

H0 (null hypothesis): m1 = m2 (no significant difference)

H1 (altnerate hypothesis): m1 != m2 (significant difference)

Week 1

Run an ANOVA (Analysis of Variance).

 

In [78]:
# import stats class to be used for ANOVA
import statsmodels.formula.api as smf

Metro vs. Non-Metro Poverty ANOVA

Is there significant difference in the average of poverty between metro and non-metro counties?

 

In [79]:
# run ANOVA; ols stands for orindary least squares
sub2 = df[['Metropolitan','Poverty']]

poverty_anova = smf.ols(formula='Poverty ~ C(Metropolitan)', data=sub2).fit()

# print results
print(poverty_anova.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Poverty   R-squared:                       0.041
Model:                            OLS   Adj. R-squared:                  0.040
Method:                 Least Squares   F-statistic:                     88.53
Date:                Wed, 19 Sep 2018   Prob (F-statistic):           1.28e-20
Time:                        15:56:48   Log-Likelihood:                -6686.9
No. Observations:                2083   AIC:                         1.338e+04
Df Residuals:                    2081   BIC:                         1.339e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               17.3445      0.165    104.864      0.000      17.020      17.669
C(Metropolitan)[T.1]    -2.5646      0.273     -9.409      0.000      -3.099      -2.030
==============================================================================
Omnibus:                      184.380   Durbin-Watson:                   1.373
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              257.169
Skew:                           0.708   Prob(JB):                     1.43e-56
Kurtosis:                       3.978   Cond. No.                         2.42
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [80]:
# print average poverty by metropolitan
sub2.groupby('Metropolitan').mean()
Out[80]:
Poverty
Metropolitan
0 17.344529
1 14.779922

Metro vs. Non-Metro Poverty ANOVA Results: Reject the Null Hypothesis

Since the f-statistic is fairly large, the variation among sample means is significant. The p-value is in scientific notation. The -20 means we need to move the decimal 20 places to the left. So, the p-value is 0.0000000000000000000128. The p-value is well below 0.05, so we can reject the null hypothesis. In other words, average poverty of non-metro counties is significantly higher than average poverty of metro counties. It is virtually impossible that this is due to sampling error.

 

Poverty Group Violent Crime Rate ANOVA

Is there significant difference in the average of violent crime rate between counties with low poverty (<= 16%) and those with high poverty (> 16%)?

 

In [81]:
# run ANOVA
sub3 = sub1[['Violent Crime Rate','Poverty Group']]
sub3.columns = ['Violent_Crime_Rate','Poverty_Group'] # need to remove spaces from column headers

vcr_anova = smf.ols(formula='Violent_Crime_Rate ~ C(Poverty_Group)', data=sub3).fit()

# print results
print(vcr_anova.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:     Violent_Crime_Rate   R-squared:                       0.045
Model:                            OLS   Adj. R-squared:                  0.045
Method:                 Least Squares   F-statistic:                     97.98
Date:                Wed, 19 Sep 2018   Prob (F-statistic):           1.32e-22
Time:                        15:56:48   Log-Likelihood:                -12666.
No. Observations:                2083   AIC:                         2.534e+04
Df Residuals:                    2081   BIC:                         2.535e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept                    75.6006      3.228     23.417      0.000      69.269      81.932
C(Poverty_Group)[T.> 16%]    45.9387      4.641      9.899      0.000      36.837      55.040
==============================================================================
Omnibus:                     1208.216   Durbin-Watson:                   1.431
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            12515.974
Skew:                           2.578   Prob(JB):                         0.00
Kurtosis:                      13.845   Cond. No.                         2.58
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [82]:
# print average vcr by poverty group
sub3.groupby('Poverty_Group').mean()
Out[82]:
Violent_Crime_Rate
Poverty_Group
<= 16% 75.600558
> 16% 121.539296

Poverty Group Violent Crime Rate ANOVA Results: Reject the Null Hypothesis

The p-value is well below 0.05, so we can reject the null hypothesis. Taking the results of the ANOVA test into consideration with the means for each poverty group, we may conclude as follows: Counties with poverty above the median (> 16%) have significantly higher violent crime rates than counties below the median.

 

US Regions Crime Rate ANOVA

I’m going to create a new column called “Regions”. I am using the US Census Bureau-designated regions. I then want to perform an ANOVA on the crime rate (violent and property crime rates added together) by region. I will have to perform a post hoc paired comparison, as there will be more than two groups.

 

In [83]:
# regions function
def get_regions(row):
    if row['State'] in ['Connecticut','Maine','Massachusetts','New Hampshire','Rhode Island','Vermont',
                        'New Jersey','New York','Pennsylvania']:
        return 'Northeast'
    elif row['State'] in ['Illinois','Indiania','Michigan','Ohio','Wisconsin','Iowa','Kansas','Minnesota',
                         'Missouri','Nebraska','North Dakota','South Dakota']:
        return 'Midwest'
    elif row['State'] in ['Delaware','Florida','Georgia','Maryland','North Carolina','South Carolina',
                          'Virginia','West Virginia','Alabama','Kentucky','Mississippi','Tennessee',
                          'Arkansas','Louisiana','Oklahoma','Texas']:
        return 'South'
    else:
        return 'West'
    
# add regions to sub1
sub1['Region'] = sub1.apply(lambda row: get_regions(row), axis=1)

# check
sub1[sub1['State']=='California'].head(1)
State County Poverty ChildPoverty Violent Crime Rate Property Crime Rate Employee Rate Total Employees Poverty Group Violent Crime Group Property Crime Group Region
99 California Alameda 12.5 15.2 32.18 131.04 98.42 1560 <= 16% lower half lower half West
In [84]:
sub1[sub1['State']=='Florida'].head(1)
Out[84]:
State County Poverty ChildPoverty Violent Crime Rate Property Crime Rate Employee Rate Total Employees Poverty Group Violent Crime Group Property Crime Group Region
156 Florida Alachua 24.3 23.5 213.6 733.23 136.89 348 > 16% upper half upper half South
In [85]:
sub1[sub1['State']=='Michigan'].head(1)
Out[85]:
State County Poverty ChildPoverty Violent Crime Rate Property Crime Rate Employee Rate Total Employees Poverty Group Violent Crime Group Property Crime Group Region
686 Michigan Alcona 15.2 21.4 123.22 1184.83 218.01 23 <= 16% upper half upper half Midwest
In [86]:
sub1[sub1['State']=='Maine'].head(1)
Out[86]:
State County Poverty ChildPoverty Violent Crime Rate Property Crime Rate Employee Rate Total Employees Poverty Group Violent Crime Group Property Crime Group Region
650 Maine Androscoggin 15.7 23.5 14.9 126.64 26.07 28 <= 16% lower half lower half Northeast
In [87]:
# add crime rate column to sub1
sub1['Crime Rate'] = sub1['Violent Crime Rate'] + sub1['Property Crime Rate']

# view
sub1.head()
State County Poverty ChildPoverty Violent Crime Rate Property Crime Rate Employee Rate Total Employees Poverty Group Violent Crime Group Property Crime Group Region Crime Rate
0 Alabama Baldwin 13.4 19.2 58.94 332.10 147.60 288 <= 16% lower half lower half South 391.04
1 Alabama Bibb 16.8 27.9 30.97 181.38 53.09 12 > 16% lower half lower half South 212.35
2 Alabama Bullock 24.6 38.4 196.67 486.98 121.75 13 > 16% upper half lower half South 683.65
3 Alabama Clay 16.7 22.5 66.48 110.81 184.68 25 > 16% upper half lower half South 177.29
4 Alabama Cleburne 17.0 26.3 39.99 953.21 179.98 27 > 16% lower half upper half South 993.20
In [88]:
# run ANOVA
sub4 = sub1[['Region','Crime Rate']]
sub4.columns = ['Region','Crime_Rate'] # need to remove spaces from column headers

regions_vcr_anova = smf.ols(formula='Crime_Rate ~ C(Region)', data=sub4).fit()

# print results
print(regions_vcr_anova.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Crime_Rate   R-squared:                       0.124
Model:                            OLS   Adj. R-squared:                  0.123
Method:                 Least Squares   F-statistic:                     98.23
Date:                Thu, 20 Sep 2018   Prob (F-statistic):           1.84e-59
Time:                        19:54:34   Log-Likelihood:                -16231.
No. Observations:                2083   AIC:                         3.247e+04
Df Residuals:                    2079   BIC:                         3.249e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                591.8152     22.155     26.712      0.000     548.367     635.264
C(Region)[T.Northeast]  -405.5965     56.197     -7.217      0.000    -515.805    -295.387
C(Region)[T.South]       368.8758     29.187     12.638      0.000     311.637     426.115
C(Region)[T.West]        220.6612     40.470      5.453      0.000     141.296     300.026
==============================================================================
Omnibus:                      800.790   Durbin-Watson:                   1.564
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4366.823
Skew:                           1.733   Prob(JB):                         0.00
Kurtosis:                       9.189   Cond. No.                         5.29
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [89]:
sub4.groupby('Region').mean()
Out[89]:
Crime_Rate
Region
Midwest 591.815221
Northeast 186.218760
South 960.691039
West 812.476400
In [90]:
# import multicomp for post hoc paired comparison
import statsmodels.stats.multicomp as multi

# run post hoc paired comparison
sub4_mc = multi.MultiComparison(sub4['Crime_Rate'], sub4['Region']).tukeyhsd()
sub4_mc.summary()
Out[90]:
Multiple Comparison of Means – Tukey HSD,FWER=0.05
group1 group2 meandiff lower upper reject
Midwest Northeast -405.5965 -550.0863 -261.1066 True
Midwest South 368.8758 293.8321 443.9196 True
Midwest West 220.6612 116.6094 324.713 True
Northeast South 774.4723 632.9827 915.9619 True
Northeast West 626.2576 467.4668 785.0485 True
South West -148.2146 -248.0583 -48.3709 True

US Regions Crime Rate ANOVA Results: Reject the Null Hypothesis

The ANOVA test lets us know that at least two of the regions have significantly different mean crime rates from each other. However, in order to know which regions are significantly different from one another, we need to perform a post hoc paired comparison. The above table, which is the post hoc paired comparison table, let’s us know that all regions have significantly different mean crime rates from one another. The True in the reject column lets us know we can reject the null hypothesis (i.e. the pair means are significantly different from one another). If it said false, then the mean crime rates between that pair would not be significantly different and we would not be able to reject the null hypothesis for that pair.

The South has the highest mean crime rate at 960.69, West is second at 812.5, Midwest is third at 591.8, and Northeast is last at 186.2. These rates are per 100,000 people.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s