## Week 3

This week we are required to perform additional data management (or data munging), such as coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables. I actually performed some of this for the week 2 assignment, even though it was not required during that week. For instance, I created the Violent Crime Rate and Property Crime Rate variables. I also removed all nulls from my dataset, as there were relatively few rows that contained nulls. That being said, there are additional data munging procedures I would like to perform on this dataset.

```
# let's take another look at the data
df.head()
```

## Subset by Metropolitan

I want to create two subsets of the current dataset: one consisting of only metropolitan counties and the other consisting of only non-metropolitcan counties. I then want to explore frequencies of poverty and crime rates within these subsets.

```
# subset the data by Metropolitan
metro = df[df['Metropolitan']==1]
non_metro = df[df['Metropolitan']==0]
```

## Poverty Frequencies

```
# create poverty bins and view frequencies by metro and non_metro
poverty_bins = [0,10,20,30,40,50]
metro_poverty_freq = pd.cut(metro['Poverty'], poverty_bins).value_counts(sort=False)
nonmetro_poverty_freq = pd.cut(non_metro['Poverty'], poverty_bins).value_counts(sort=False)
metro_poverty_prop = pd.cut(metro['Poverty'], poverty_bins).value_counts(sort=False, normalize=True)
nonmetro_poverty_prop = pd.cut(non_metro['Poverty'], poverty_bins).value_counts(sort=False, normalize=True)
# metro count by poverty percentage bin
print('Metro count by poverty percentage bin')
metro_poverty_freq
```

```
# metro proportion by poverty percentage bin
print('Metro proportion by poverty percentage bin')
metro_poverty_prop
```

```
# average metro poverty percentage
metro_poverty_avg = round(metro['Poverty'].mean(), 2)
metro_poverty_avg
```

```
# non-metro count by poverty percentage bin
print('Non-metro count by poverty percentage bin')
nonmetro_poverty_freq
```

```
# non-metro proportion by poverty percentage bin
print('Non-metro proportion by poverty percentage bin')
nonmetro_poverty_prop
```

```
# average non-metro poverty percentage
nonmetro_poverty_avg = round(non_metro['Poverty'].mean(), 2)
nonmetro_poverty_avg
```

```
# plot poverty distribution by metropolitan
# take a sample from df so the plot isn't crowded
sns.stripplot(x='Metropolitan', y='Poverty', data=df[:700], jitter=True, size=3)
sns.despine()
plt.legend(labels=['Non-Metro','Metro'], loc='upper center', frameon=False)
plt.tick_params(axis='x', bottom=False, labelbottom=False)
plt.title('Pct. Poverty, Metro vs. Non-Metro', fontweight='bold', loc='left', y=1.02)
plt.axhline(nonmetro_poverty_avg)
plt.axhline(metro_poverty_avg, color='orange')
plt.text(0.35, 18, 'avg. lines')
plt.savefig('sample_poverty_spread_metro_nonmetro.png')
plt.show()
```

## Poverty of Metro and Non-Metro Counties

Both metro and non-metro counties have a proportional majority of poverty between 10-20%. While metro counties have a bulk of their poverty percentages between 0-10% and 10-20%, higher proportions than non-metro counties, non-metro counties have higher proportions in the higher poverty ranges. The dotplot (a.k.a. stripplot) above, helps us to visualize and compare the poverty distribution among metropolitan and non-metropolitan counties.

Looking at the average poverty percentage by metro and non-metro counties, we see that metro has 14.78% and non-metro has 17.34%. I performed research after observing these figures and had my analysis confirmed, as you can see from the below image, which is taken from the Center on Budget and Policy Priorities. Note the percentages in the 2015 column. The minor variation between my figures and theirs is due to the fact that both are based on representative samples, rather than the population (all the data) itself.

Interestingly, in the same article linked to above, it notes that poverty is actually lower in non-metro areas when you take into consideration the Census Bureau’s Supplemental Poverty Measure (SPM); however, noting that some analysts do not agree with the SPM approach:

The above poverty data, whether from the CPS or the ACS, reflect the official poverty measure, which doesn’t account for most government benefits or adjust for cost-of-living differences by geographic area. Under the official measure, poverty is much higher in non-metro areas than in metro areas. But under the Census Bureau’s Supplemental Poverty Measure (SPM), which accounts for most government benefits and adjusts for local cost-of-living differences, the poverty rate is actually lower in non-metro areas (13.2 percent) than in metro areas (14.5 percent). Although some analysts do not fully agree with the SPM’s approach to geographic adjustment, many analysts agree that either those adjustments or another, similar approach makes sense.

To find out more on the Supplemental Poverty Measure, click here.

## Hypothesis Test of Average Poverty Between Metro and Non-Metro Counties

The question now arises, “Are the average poverty percentages between metro and non-metro counties statistically significant?” In other words, can we confidently say that non-metro counties have greater poverty on average or could this simply be due to sampling? Following is a z-test to establish the significance of these results. I will be using a significance level (alpha) of 0.05. My null and alternative hypotheses are as follows:

H0 (null hypothesis): m1 = m2 (i.e. no significant difference between the means)

H1 (alternative hypothesis): m1 != m2 (i.e. significant difference between the means)

The size of both samples should be the same or roughly the same. Since the size of the non-metro data is almost twice as large as the metro data, I need to take a sample of the non-metro data that is equivalent to the metro data.

```
# view sample sizes
print('Metro sample size')
print(len(metro))
print('\n' + 'Non-metro sample size')
print(len(non_metro))
```

```
# reduce non-metro to equal metro
non_metro_767 = non_metro[:767]
len(non_metro_767)
```

```
# import ztest from statsmodels.stats.weightstats
from statsmodels.stats.weightstats import ztest
# assign variables
metro_poverty = metro['Poverty']
nonmetro767_poverty = non_metro_767['Poverty']
# run test
poverty_ztest = ztest(x1=metro_poverty, x2=nonmetro767_poverty)
# print results
print('Test Statistic')
print(poverty_ztest[0])
print('\n' + 'P-Value')
print('{:.25f}'.format(float(poverty_ztest[1])))
```

## Poverty Z-Test Results: Reject the Null Hypothesis

The p-value is well below the significance level of 0.05. This means we can reject the null hypothesis (no significant difference between the means) and accept the alternative hypothesis (significant difference between the means). If we take the critical value approach, which uses the test statistic, we also reject the null hypohtesis, as -9.4 is less than -1.96. To understand the critical value approach and how we arrive at -1.96/+1.96, check out p. 139 of this book (pdf will download after you click on the link).

In other words, we can confidently say that the average percentage of poverty is greater in non-metro areas than in metro areas. Keep in mind, however, that this does not take into consideration the Supplemental Poverty Measure (SPM).

## Crime Rate Frequencies

```
# create Violent Crime Rate and Property Crime Rate bins
# and view frequencies by metro and non-metro
vcr_bins = [0,50,100,150,200,250,300,1000]
pcr_bins = [0,250,500,750,1000,1250,1500,6000]
# violent crime rate variables
metro_vcr_freq = pd.cut(metro['Violent Crime Rate'], vcr_bins).value_counts(sort=False)
metro_vcr_prop = pd.cut(metro['Violent Crime Rate'], vcr_bins).value_counts(sort=False, normalize=True)
nonmetro_vcr_freq = pd.cut(non_metro['Violent Crime Rate'], vcr_bins).value_counts(sort=False)
nonmetro_vcr_prop = pd.cut(non_metro['Violent Crime Rate'], vcr_bins).value_counts(sort=False, normalize=True)
# property crime rate variables
metro_pcr_freq = pd.cut(metro['Property Crime Rate'], pcr_bins).value_counts(sort=False)
metro_pcr_prop = pd.cut(metro['Property Crime Rate'], pcr_bins).value_counts(sort=False, normalize=True)
nonmetro_pcr_freq = pd.cut(non_metro['Property Crime Rate'], pcr_bins).value_counts(sort=False)
nonmetro_pcr_prop = pd.cut(non_metro['Property Crime Rate'], pcr_bins).value_counts(sort=False, normalize=True)
# metro count by violent crime rate bin
print('Metro count by violent crime rate bin')
metro_vcr_freq
```

```
# metro proportion by violent crime rate bin
metro_vcr_prop
```

```
# average metro violent crime rate
metro_vcr_avg = round(metro['Violent Crime Rate'].mean(), 2)
metro_vcr_avg
```

```
# non-metro count by violent crime rate bin
nonmetro_vcr_freq
```

```
# non-metro proportion by violent crime rate bin
nonmetro_vcr_prop
```

```
# average non-metro violent crime rate
nonmetro_vcr_avg = non_metro['Violent Crime Rate'].mean()
nonmetro_vcr_avg
```

```
# metro count by property crime rate bin
metro_pcr_freq
```

```
# metro proportions by property crime rate bin
metro_pcr_prop
```

```
# average metro property crime rate
metro_pcr_avg = round(metro['Property Crime Rate'].mean(), 2)
metro_pcr_avg
```

```
# non-metro count by property crime rate bin
nonmetro_pcr_freq
```

```
# non-metro proportions by property crime rate bin
nonmetro_pcr_prop
```

```
# average non-metro property crime rate
nonmetro_pcr_avg = round(non_metro['Property Crime Rate'].mean(), 2)
nonmetro_pcr_avg
```

```
# plot violent crime rate distribution by metropolitan
# take a sample from df so the plot isn't crowded
sns.stripplot(x='Metropolitan', y='Violent Crime Rate', data=df[:700], jitter=True, size=3)
sns.despine()
plt.legend(labels=['Non-Metro','Metro'], loc='upper center', frameon=False)
plt.tick_params(axis='x', bottom=False, labelbottom=False)
plt.title('Violent Crime Rate, Metro vs. Non-Metro', fontweight='bold', loc='left', y=1.02)
plt.axhline(nonmetro_vcr_avg)
plt.axhline(metro_vcr_avg, color='orange')
plt.text(0.35, 175, 'avg. lines')
plt.savefig('sample_vcr_spread_metro_nonmetro.png')
plt.show()
```

In [60]:

# plot property crime rate distribution by metropolitan # take a sample from df so the plot isn't crowded sns.stripplot(x='Metropolitan', y='Property Crime Rate', data=df[:700], jitter=True, size=3) sns.despine() plt.legend(labels=['Non-Metro','Metro'], loc='upper center', frameon=False) plt.tick_params(axis='x', bottom=False, labelbottom=False) plt.title('Property Crime Rate, Metro vs. Non-Metro', fontweight='bold', loc='left', y=1.02) plt.axhline(nonmetro_pcr_avg) plt.axhline(metro_pcr_avg, color='orange') plt.text(0.35, 900, 'avg. lines') plt.savefig('sample_pcr_spread_metro_nonmetro.png') plt.show()

## Hypothesis Test of Average Violent Crime and Property Crime Rates Between Metro and Non-Metro

Just like the hypothesis test performed on average poverty between metro and non-metro counties, I want to perform a test of average violent crime rate between metro and non-metro counties and average property crime rate between metro and non-metro counties. A significance level of 0.05 will be utilized. My hypotheses are as follows:

H0 (null hypothesis): m1 = m2 (no significant difference between the means)

H1 (alternative hypothesis): m1 != m2 (significant difference between the means)

These hypothesis statements are true for both violent crime and property crime. First, we’ll run a hypothesis test on the average violent crime rate, then we’ll run a hypothesis test on the average property crime rate. I will again use the non_metro_767 subset for this test.

```
# assign variables
metro_vcr = metro['Violent Crime Rate']
nonmetro767_vcr = non_metro_767['Violent Crime Rate']
# run test
vcr_ztest = ztest(x1=metro_vcr, x2=nonmetro767_vcr)
# print results
print('Test Statistic')
print(vcr_ztest[0])
print('\n' + 'P-Value')
print('{:.25f}'.format(float(vcr_ztest[1])))
```

```
# assign variables
metro_pcr = metro['Property Crime Rate']
nonmetro767_pcr = non_metro_767['Property Crime Rate']
# run test
pcr_ztest = ztest(x1=metro_pcr, x2=nonmetro767_pcr)
# print results
print('Test Statistic')
print(pcr_ztest[0])
print('\n' + 'P-Value')
print('{:.25f}'.format(float(pcr_ztest[1])))
```

## Crime Rate Z-Test Results: Fail to Reject the Null Hypothesis

For the z-test on the average violent crime rates between metro and non-metro counties, we fail to reject the null hypothesis, as the p-value is not less than the significance level of 0.05. Recall that the null hypothesis says that the means between the two groups are equal. This does not necessarily mean that they are, but that we do not have statistically significant proof to say that they are *not equal*.

For the z-test on the average property crime rates between metro and non-metro counties, we also fail to reject the null hypothesis, as the p-value is not less than the significance level of 0.05.

## Average of Key Variables by State

```
# create subset of df
sub1 = df[['State','County','Poverty','ChildPoverty','Violent Crime Rate','Property Crime Rate','Employee Rate']]
# create state group
state_group = sub1.groupby(by='State')
# generate average of variables by state
state_group.mean()
```