Data Analysis & Interpretation 1.3: Frequency Distributions and Z-Tests

Week 3

This week we are required to perform additional data management (or data munging), such as coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables. I actually performed some of this for the week 2 assignment, even though it was not required during that week. For instance, I created the Violent Crime Rate and Property Crime Rate variables. I also removed all nulls from my dataset, as there were relatively few rows that contained nulls. That being said, there are additional data munging procedures I would like to perform on this dataset.

 

In [35]:
# let's take another look at the data
df.head()
Out[35]:
CensusId State County TotalPop Men Women Hispanic White Black Native SelfEmployed FamilyWork Unemployment Violent Crime Property Crime Metropolitan Total Employees Violent Crime Rate Property Crime Rate Employee Rate
0 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6 5.8 0.4 7.5 115 648 1 288 58.94 332.10 147.60
1 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 6.7 0.4 8.3 7 41 1 12 30.97 181.38 53.09
2 1011 Alabama Bullock 10678 5660 5018 4.4 22.2 70.7 1.2 5.4 0.0 18.0 21 52 0 13 196.67 486.98 121.75
3 1027 Alabama Clay 13537 6671 6866 3.2 79.9 14.4 0.7 7.8 0.0 9.4 9 15 0 25 66.48 110.81 184.68
4 1029 Alabama Cleburne 15002 7334 7668 2.3 92.5 2.9 0.2 8.4 0.0 8.3 6 143 0 27 39.99 953.21 179.98

5 rows × 44 columns

Subset by Metropolitan

I want to create two subsets of the current dataset: one consisting of only metropolitan counties and the other consisting of only non-metropolitcan counties. I then want to explore frequencies of poverty and crime rates within these subsets.

 

In [36]:
# subset the data by Metropolitan
metro = df[df['Metropolitan']==1]
non_metro = df[df['Metropolitan']==0]

Poverty Frequencies

In [37]:
# create poverty bins and view frequencies by metro and non_metro
poverty_bins = [0,10,20,30,40,50]
metro_poverty_freq = pd.cut(metro['Poverty'], poverty_bins).value_counts(sort=False)
nonmetro_poverty_freq = pd.cut(non_metro['Poverty'], poverty_bins).value_counts(sort=False)
metro_poverty_prop = pd.cut(metro['Poverty'], poverty_bins).value_counts(sort=False, normalize=True)
nonmetro_poverty_prop = pd.cut(non_metro['Poverty'], poverty_bins).value_counts(sort=False, normalize=True)

# metro count by poverty percentage bin
print('Metro count by poverty percentage bin')
metro_poverty_freq
Metro count by poverty percentage bin
Out[37]:
(0, 10]     152
(10, 20]    502
(20, 30]    107
(30, 40]      5
(40, 50]      1
Name: Poverty, dtype: int64
In [38]:
# metro proportion by poverty percentage bin
print('Metro proportion by poverty percentage bin')
metro_poverty_prop
Metro proportion by poverty percentage bin
Out[38]:
(0, 10]     0.198175
(10, 20]    0.654498
(20, 30]    0.139505
(30, 40]    0.006519
(40, 50]    0.001304
Name: Poverty, dtype: float64
In [39]:
# average metro poverty percentage
metro_poverty_avg = round(metro['Poverty'].mean(), 2)
metro_poverty_avg
Out[39]:
14.78
In [40]:
# non-metro count by poverty percentage bin
print('Non-metro count by poverty percentage bin')
nonmetro_poverty_freq
Non-metro count by poverty percentage bin
Out[40]:
(0, 10]     136
(10, 20]    776
(20, 30]    359
(30, 40]     40
(40, 50]      5
Name: Poverty, dtype: int64
In [41]:
# non-metro proportion by poverty percentage bin
print('Non-metro proportion by poverty percentage bin')
nonmetro_poverty_prop
Non-metro proportion by poverty percentage bin
Out[41]:
(0, 10]     0.103343
(10, 20]    0.589666
(20, 30]    0.272796
(30, 40]    0.030395
(40, 50]    0.003799
Name: Poverty, dtype: float64
In [42]:
# average non-metro poverty percentage
nonmetro_poverty_avg = round(non_metro['Poverty'].mean(), 2)
nonmetro_poverty_avg
Out[42]:
17.34
In [43]:
# plot poverty distribution by metropolitan
# take a sample from df so the plot isn't crowded
sns.stripplot(x='Metropolitan', y='Poverty', data=df[:700], jitter=True, size=3)
sns.despine()
plt.legend(labels=['Non-Metro','Metro'], loc='upper center', frameon=False)
plt.tick_params(axis='x', bottom=False, labelbottom=False)
plt.title('Pct. Poverty, Metro vs. Non-Metro', fontweight='bold', loc='left', y=1.02)
plt.axhline(nonmetro_poverty_avg)
plt.axhline(metro_poverty_avg, color='orange')
plt.text(0.35, 18, 'avg. lines')
plt.savefig('sample_poverty_spread_metro_nonmetro.png')
plt.show()

sample_poverty_spread_metro_nonmetro

Poverty of Metro and Non-Metro Counties

Both metro and non-metro counties have a proportional majority of poverty between 10-20%. While metro counties have a bulk of their poverty percentages between 0-10% and 10-20%, higher proportions than non-metro counties, non-metro counties have higher proportions in the higher poverty ranges. The dotplot (a.k.a. stripplot) above, helps us to visualize and compare the poverty distribution among metropolitan and non-metropolitan counties.

Looking at the average poverty percentage by metro and non-metro counties, we see that metro has 14.78% and non-metro has 17.34%. I performed research after observing these figures and had my analysis confirmed, as you can see from the below image, which is taken from the Center on Budget and Policy Priorities. Note the percentages in the 2015 column. The minor variation between my figures and theirs is due to the fact that both are based on representative samples, rather than the population (all the data) itself.

2015_poverty

 

Interestingly, in the same article linked to above, it notes that poverty is actually lower in non-metro areas when you take into consideration the Census Bureau’s Supplemental Poverty Measure (SPM); however, noting that some analysts do not agree with the SPM approach:

The above poverty data, whether from the CPS or the ACS, reflect the official poverty measure, which doesn’t account for most government benefits or adjust for cost-of-living differences by geographic area. Under the official measure, poverty is much higher in non-metro areas than in metro areas. But under the Census Bureau’s Supplemental Poverty Measure (SPM), which accounts for most government benefits and adjusts for local cost-of-living differences, the poverty rate is actually lower in non-metro areas (13.2 percent) than in metro areas (14.5 percent). Although some analysts do not fully agree with the SPM’s approach to geographic adjustment, many analysts agree that either those adjustments or another, similar approach makes sense.

To find out more on the Supplemental Poverty Measure, click here.

Hypothesis Test of Average Poverty Between Metro and Non-Metro Counties

The question now arises, “Are the average poverty percentages between metro and non-metro counties statistically significant?” In other words, can we confidently say that non-metro counties have greater poverty on average or could this simply be due to sampling? Following is a z-test to establish the significance of these results. I will be using a significance level (alpha) of 0.05. My null and alternative hypotheses are as follows:

H0 (null hypothesis): m1 = m2 (i.e. no significant difference between the means)

H1 (alternative hypothesis): m1 != m2 (i.e. significant difference between the means)

The size of both samples should be the same or roughly the same. Since the size of the non-metro data is almost twice as large as the metro data, I need to take a sample of the non-metro data that is equivalent to the metro data.

 

In [44]:
# view sample sizes
print('Metro sample size')
print(len(metro))
print('\n' + 'Non-metro sample size')
print(len(non_metro))
Metro sample size
767

Non-metro sample size
1316
In [45]:
# reduce non-metro to equal metro
non_metro_767 = non_metro[:767]
len(non_metro_767)
Out[45]:
767
In [46]:
# import ztest from statsmodels.stats.weightstats
from statsmodels.stats.weightstats import ztest

# assign variables
metro_poverty = metro['Poverty']
nonmetro767_poverty = non_metro_767['Poverty']

# run test
poverty_ztest = ztest(x1=metro_poverty, x2=nonmetro767_poverty)

# print results
print('Test Statistic')
print(poverty_ztest[0])
print('\n' + 'P-Value')
print('{:.25f}'.format(float(poverty_ztest[1])))
Test Statistic
-9.166777167988164

P-Value
0.0000000000000000000487372

Poverty Z-Test Results: Reject the Null Hypothesis

The p-value is well below the significance level of 0.05. This means we can reject the null hypothesis (no significant difference between the means) and accept the alternative hypothesis (significant difference between the means). If we take the critical value approach, which uses the test statistic, we also reject the null hypohtesis, as -9.4 is less than -1.96. To understand the critical value approach and how we arrive at -1.96/+1.96, check out p. 139 of this book (pdf will download after you click on the link).

In other words, we can confidently say that the average percentage of poverty is greater in non-metro areas than in metro areas. Keep in mind, however, that this does not take into consideration the Supplemental Poverty Measure (SPM).

Crime Rate Frequencies

In [47]:
# create Violent Crime Rate and Property Crime Rate bins
# and view frequencies by metro and non-metro
vcr_bins = [0,50,100,150,200,250,300,1000]
pcr_bins = [0,250,500,750,1000,1250,1500,6000]

# violent crime rate variables
metro_vcr_freq = pd.cut(metro['Violent Crime Rate'], vcr_bins).value_counts(sort=False)
metro_vcr_prop = pd.cut(metro['Violent Crime Rate'], vcr_bins).value_counts(sort=False, normalize=True)
nonmetro_vcr_freq = pd.cut(non_metro['Violent Crime Rate'], vcr_bins).value_counts(sort=False)
nonmetro_vcr_prop = pd.cut(non_metro['Violent Crime Rate'], vcr_bins).value_counts(sort=False, normalize=True)

# property crime rate variables
metro_pcr_freq = pd.cut(metro['Property Crime Rate'], pcr_bins).value_counts(sort=False)
metro_pcr_prop = pd.cut(metro['Property Crime Rate'], pcr_bins).value_counts(sort=False, normalize=True)
nonmetro_pcr_freq = pd.cut(non_metro['Property Crime Rate'], pcr_bins).value_counts(sort=False)
nonmetro_pcr_prop = pd.cut(non_metro['Property Crime Rate'], pcr_bins).value_counts(sort=False, normalize=True)

# metro count by violent crime rate bin
print('Metro count by violent crime rate bin')
metro_vcr_freq
Metro count by violent crime rate bin
Out[47]:
(0, 50]        292
(50, 100]      173
(100, 150]     104
(150, 200]      53
(200, 250]      27
(250, 300]      31
(300, 1000]     39
Name: Violent Crime Rate, dtype: int64
In [48]:
# metro proportion by violent crime rate bin
metro_vcr_prop
Out[48]:
(0, 50]        0.406120
(50, 100]      0.240612
(100, 150]     0.144645
(150, 200]     0.073713
(200, 250]     0.037552
(250, 300]     0.043115
(300, 1000]    0.054242
Name: Violent Crime Rate, dtype: float64
In [49]:
# average metro violent crime rate
metro_vcr_avg = round(metro['Violent Crime Rate'].mean(), 2)
metro_vcr_avg
Out[49]:
92.58
In [50]:
# non-metro count by violent crime rate bin
nonmetro_vcr_freq
Out[50]:
(0, 50]        428
(50, 100]      348
(100, 150]     171
(150, 200]     110
(200, 250]      61
(250, 300]      46
(300, 1000]     62
Name: Violent Crime Rate, dtype: int64
In [51]:
# non-metro proportion by violent crime rate bin
nonmetro_vcr_prop
Out[51]:
(0, 50]        0.349103
(50, 100]      0.283850
(100, 150]     0.139478
(150, 200]     0.089723
(200, 250]     0.049755
(250, 300]     0.037520
(300, 1000]    0.050571
Name: Violent Crime Rate, dtype: float64
In [52]:
# average non-metro violent crime rate
nonmetro_vcr_avg = non_metro['Violent Crime Rate'].mean()
nonmetro_vcr_avg
Out[52]:
100.88908054711244
In [53]:
# metro count by property crime rate bin
metro_pcr_freq
Out[53]:
(0, 250]        168
(250, 500]      171
(500, 750]      131
(750, 1000]      88
(1000, 1250]     54
(1250, 1500]     42
(1500, 6000]     75
Name: Property Crime Rate, dtype: int64
In [54]:
# metro proportions by property crime rate bin
metro_pcr_prop
Out[54]:
(0, 250]        0.230453
(250, 500]      0.234568
(500, 750]      0.179698
(750, 1000]     0.120713
(1000, 1250]    0.074074
(1250, 1500]    0.057613
(1500, 6000]    0.102881
Name: Property Crime Rate, dtype: float64
In [55]:
# average metro property crime rate
metro_pcr_avg = round(metro['Property Crime Rate'].mean(), 2)
metro_pcr_avg
Out[55]:
657.42
In [56]:
# non-metro count by property crime rate bin
nonmetro_pcr_freq
Out[56]:
(0, 250]        216
(250, 500]      326
(500, 750]      284
(750, 1000]     176
(1000, 1250]    119
(1250, 1500]     67
(1500, 6000]     93
Name: Property Crime Rate, dtype: int64
In [57]:
# non-metro proportions by property crime rate bin
nonmetro_pcr_prop
Out[57]:
(0, 250]        0.168618
(250, 500]      0.254489
(500, 750]      0.221702
(750, 1000]     0.137393
(1000, 1250]    0.092896
(1250, 1500]    0.052303
(1500, 6000]    0.072600
Name: Property Crime Rate, dtype: float64
In [58]:
# average non-metro property crime rate
nonmetro_pcr_avg = round(non_metro['Property Crime Rate'].mean(), 2)
nonmetro_pcr_avg
Out[58]:
676.4
In [59]:
# plot violent crime rate distribution by metropolitan
# take a sample from df so the plot isn't crowded
sns.stripplot(x='Metropolitan', y='Violent Crime Rate', data=df[:700], jitter=True, size=3)
sns.despine()
plt.legend(labels=['Non-Metro','Metro'], loc='upper center', frameon=False)
plt.tick_params(axis='x', bottom=False, labelbottom=False)
plt.title('Violent Crime Rate, Metro vs. Non-Metro', fontweight='bold', loc='left', y=1.02)
plt.axhline(nonmetro_vcr_avg)
plt.axhline(metro_vcr_avg, color='orange')
plt.text(0.35, 175, 'avg. lines')
plt.savefig('sample_vcr_spread_metro_nonmetro.png')
plt.show()

sample_vcr_spread_metro_nonmetro

In [60]:

# plot property crime rate distribution by metropolitan
# take a sample from df so the plot isn't crowded
sns.stripplot(x='Metropolitan', y='Property Crime Rate', data=df[:700], jitter=True, size=3)
sns.despine()
plt.legend(labels=['Non-Metro','Metro'], loc='upper center', frameon=False)
plt.tick_params(axis='x', bottom=False, labelbottom=False)
plt.title('Property Crime Rate, Metro vs. Non-Metro', fontweight='bold', loc='left', y=1.02)
plt.axhline(nonmetro_pcr_avg)
plt.axhline(metro_pcr_avg, color='orange')
plt.text(0.35, 900, 'avg. lines')
plt.savefig('sample_pcr_spread_metro_nonmetro.png')
plt.show()

sample_pcr_spread_metro_nonmetro

Hypothesis Test of Average Violent Crime and Property Crime Rates Between Metro and Non-Metro

Just like the hypothesis test performed on average poverty between metro and non-metro counties, I want to perform a test of average violent crime rate between metro and non-metro counties and average property crime rate between metro and non-metro counties. A significance level of 0.05 will be utilized. My hypotheses are as follows:

H0 (null hypothesis): m1 = m2 (no significant difference between the means)

H1 (alternative hypothesis): m1 != m2 (significant difference between the means)

These hypothesis statements are true for both violent crime and property crime. First, we’ll run a hypothesis test on the average violent crime rate, then we’ll run a hypothesis test on the average property crime rate. I will again use the non_metro_767 subset for this test.

 

In [61]:
# assign variables
metro_vcr = metro['Violent Crime Rate']
nonmetro767_vcr = non_metro_767['Violent Crime Rate']

# run test
vcr_ztest = ztest(x1=metro_vcr, x2=nonmetro767_vcr)

# print results
print('Test Statistic')
print(vcr_ztest[0])
print('\n' + 'P-Value')
print('{:.25f}'.format(float(vcr_ztest[1])))
Test Statistic
-1.7954873412304946

P-Value
0.0725760878391805974718665
In [62]:
# assign variables
metro_pcr = metro['Property Crime Rate']
nonmetro767_pcr = non_metro_767['Property Crime Rate']

# run test
pcr_ztest = ztest(x1=metro_pcr, x2=nonmetro767_pcr)

# print results
print('Test Statistic')
print(pcr_ztest[0])
print('\n' + 'P-Value')
print('{:.25f}'.format(float(pcr_ztest[1])))
Test Statistic
0.2762800224241633

P-Value
0.7823329987879028557529182

Crime Rate Z-Test Results: Fail to Reject the Null Hypothesis

For the z-test on the average violent crime rates between metro and non-metro counties, we fail to reject the null hypothesis, as the p-value is not less than the significance level of 0.05. Recall that the null hypothesis says that the means between the two groups are equal. This does not necessarily mean that they are, but that we do not have statistically significant proof to say that they are not equal.

For the z-test on the average property crime rates between metro and non-metro counties, we also fail to reject the null hypothesis, as the p-value is not less than the significance level of 0.05.

Average of Key Variables by State

In [63]:
# create subset of df
sub1 = df[['State','County','Poverty','ChildPoverty','Violent Crime Rate','Property Crime Rate','Employee Rate']]

# create state group
state_group = sub1.groupby(by='State')

# generate average of variables by state
state_group.mean()
Out[63]:
Poverty ChildPoverty Violent Crime Rate Property Crime Rate Employee Rate
State
Alabama 19.276190 28.938095 93.122381 593.259048 126.026667
Arizona 22.053846 30.353846 89.080000 517.714615 141.968462
Arkansas 20.950769 30.466154 137.573846 795.147538 154.118769
California 16.731579 21.750877 163.923684 643.353684 195.576667
Florida 18.570833 26.645833 273.318958 1264.700208 268.823750
Georgia 21.707609 30.885870 117.764457 1074.836522 251.898804
Idaho 16.193182 20.859091 81.825682 553.476364 209.138409
Illinois 14.416216 19.432432 42.712703 317.049459 117.218378
Indiana 14.050000 20.125000 51.581250 546.482813 133.532813
Iowa 11.608571 15.101429 68.187000 404.934429 150.327143
Kansas 12.514286 16.712245 89.974490 650.948367 243.710612
Kentucky 21.886170 30.303191 20.574468 308.340638 59.079255
Louisiana 20.717857 29.432143 222.970714 1409.377857 456.802500
Maine 15.100000 20.881250 25.546250 348.525000 42.243750
Maryland 10.705000 15.390000 93.515500 704.463000 116.586000
Michigan 16.831646 24.562025 88.524430 529.988861 121.052405
Minnesota 11.712791 15.559302 60.933953 529.481395 179.754419
Mississippi 25.020000 35.866667 120.957333 1035.460667 186.210667
Missouri 18.014019 25.067290 123.214860 667.631402 130.196355
Montana 15.546809 21.548936 176.247021 770.188723 263.162553
Nebraska 11.702985 15.995522 49.390149 447.891791 178.029403
Nevada 14.353333 20.520000 260.056667 1133.116667 435.026667
New Hampshire 8.975000 11.925000 2.397500 36.310000 31.085000
New Jersey 10.533333 15.233333 0.903333 0.624286 72.410476
New Mexico 23.700000 30.360000 66.322000 356.298000 129.672000
New York 14.325000 20.647917 28.695208 316.258750 73.336875
North Carolina 19.167164 28.161194 106.083881 1108.745373 192.920149
North Dakota 11.461538 13.694231 59.330192 532.614423 210.322885
Ohio 15.787805 22.841463 42.216585 661.170976 101.434146
Oklahoma 17.372603 23.901370 66.295068 548.222603 143.983014
Oregon 17.948000 25.188000 68.830800 657.871600 158.754400
Pennsylvania 12.446667 17.450000 0.352667 0.039667 20.288000
South Carolina 20.702632 31.002632 305.334211 1687.568684 182.668158
South Dakota 14.397619 16.195238 52.795952 283.746190 147.221190
Tennessee 19.484444 27.572222 176.594111 916.266667 232.742667
Texas 17.040957 24.421809 99.775638 703.015319 282.807553
Utah 13.344000 16.064000 77.743200 679.028800 389.619600
Vermont 11.650000 16.260000 9.915000 118.146000 36.171000
Virginia 13.903529 19.580000 98.041529 859.470941 211.397882
Washington 16.270270 21.608108 75.273784 956.304595 128.849730
West Virginia 18.144828 24.934483 131.048966 364.448966 83.771379
Wisconsin 12.563380 17.356338 50.262113 524.330282 178.549859

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s