## Course 2: Data Analysis Tools

This course builds on the previous one, exploring advanced statistical methods in the area of hypothesis testing: ANOVA, Chi-Square, and Pearson correlation. Keep in mind, that I did perform hypothesis testing in the previous course even though it was not required. However, I performed z-tests. So part of this course will consist of me performing different kinds of hypothesis tests on the same variables as before.

To view all assignment posts for this specialization program, click here.

The null and alternate hypotheses are the same for all the tests. They are as follows:

H0 (null hypothesis): m1 = m2 (no significant difference)

H1 (altnerate hypothesis): m1 != m2 (significant difference)

## Week 1

Run an ANOVA (Analysis of Variance).

```
# import stats class to be used for ANOVA
import statsmodels.formula.api as smf
```

## Metro vs. Non-Metro Poverty ANOVA

Is there significant difference in the average of poverty between metro and non-metro counties?

```
# run ANOVA; ols stands for orindary least squares
sub2 = df[['Metropolitan','Poverty']]
poverty_anova = smf.ols(formula='Poverty ~ C(Metropolitan)', data=sub2).fit()
# print results
print(poverty_anova.summary())
```

```
# print average poverty by metropolitan
sub2.groupby('Metropolitan').mean()
```

## Metro vs. Non-Metro Poverty ANOVA Results: Reject the Null Hypothesis

Since the f-statistic is fairly large, the variation among sample means is significant. The p-value is in scientific notation. The -20 means we need to move the decimal 20 places to the left. So, the p-value is 0.0000000000000000000128. The p-value is well below 0.05, so we can reject the null hypothesis. In other words, average poverty of non-metro counties is significantly higher than average poverty of metro counties. It is virtually impossible that this is due to sampling error.

## Poverty Group Violent Crime Rate ANOVA

Is there significant difference in the average of violent crime rate between counties with low poverty (<= 16%) and those with high poverty (> 16%)?

```
# run ANOVA
sub3 = sub1[['Violent Crime Rate','Poverty Group']]
sub3.columns = ['Violent_Crime_Rate','Poverty_Group'] # need to remove spaces from column headers
vcr_anova = smf.ols(formula='Violent_Crime_Rate ~ C(Poverty_Group)', data=sub3).fit()
# print results
print(vcr_anova.summary())
```

```
# print average vcr by poverty group
sub3.groupby('Poverty_Group').mean()
```

## Poverty Group Violent Crime Rate ANOVA Results: Reject the Null Hypothesis

The p-value is well below 0.05, so we can reject the null hypothesis. Taking the results of the ANOVA test into consideration with the means for each poverty group, we may conclude as follows: Counties with poverty above the median (> 16%) have significantly higher violent crime rates than counties below the median.

## US Regions Crime Rate ANOVA

I’m going to create a new column called “Regions”. I am using the US Census Bureau-designated regions. I then want to perform an ANOVA on the crime rate (violent and property crime rates added together) by region. I will have to perform a post hoc paired comparison, as there will be more than two groups.

```
# regions function
def get_regions(row):
if row['State'] in ['Connecticut','Maine','Massachusetts','New Hampshire','Rhode Island','Vermont',
'New Jersey','New York','Pennsylvania']:
return 'Northeast'
elif row['State'] in ['Illinois','Indiania','Michigan','Ohio','Wisconsin','Iowa','Kansas','Minnesota',
'Missouri','Nebraska','North Dakota','South Dakota']:
return 'Midwest'
elif row['State'] in ['Delaware','Florida','Georgia','Maryland','North Carolina','South Carolina',
'Virginia','West Virginia','Alabama','Kentucky','Mississippi','Tennessee',
'Arkansas','Louisiana','Oklahoma','Texas']:
return 'South'
else:
return 'West'
# add regions to sub1
sub1['Region'] = sub1.apply(lambda row: get_regions(row), axis=1)
# check
sub1[sub1['State']=='California'].head(1)
```

```
sub1[sub1['State']=='Florida'].head(1)
```

```
sub1[sub1['State']=='Michigan'].head(1)
```

```
sub1[sub1['State']=='Maine'].head(1)
```

```
# add crime rate column to sub1
sub1['Crime Rate'] = sub1['Violent Crime Rate'] + sub1['Property Crime Rate']
# view
sub1.head()
```

```
# run ANOVA
sub4 = sub1[['Region','Crime Rate']]
sub4.columns = ['Region','Crime_Rate'] # need to remove spaces from column headers
regions_vcr_anova = smf.ols(formula='Crime_Rate ~ C(Region)', data=sub4).fit()
# print results
print(regions_vcr_anova.summary())
```

```
sub4.groupby('Region').mean()
```

```
# import multicomp for post hoc paired comparison
import statsmodels.stats.multicomp as multi
# run post hoc paired comparison
sub4_mc = multi.MultiComparison(sub4['Crime_Rate'], sub4['Region']).tukeyhsd()
sub4_mc.summary()
```

## US Regions Crime Rate ANOVA Results: Reject the Null Hypothesis

The ANOVA test lets us know that at least two of the regions have significantly different mean crime rates from each other. However, in order to know which regions are significantly different from one another, we need to perform a post hoc paired comparison. The above table, which is the post hoc paired comparison table, let’s us know that all regions have significantly different mean crime rates from one another. The True in the reject column lets us know we can reject the null hypothesis (i.e. the pair means are significantly different from one another). If it said false, then the mean crime rates between that pair would not be significantly different and we would not be able to reject the null hypothesis for that pair.

The South has the highest mean crime rate at 960.69, West is second at 812.5, Midwest is third at 591.8, and Northeast is last at 186.2. These rates are per 100,000 people.