 # Data Analysis & Interpretation 1.4: More Data Visualizations and Z-Tests

## Week 4

This week we are supposed to focus on plotting univariate and bivariate graphs.

STEP 1: Create graphs of your variables one at a time (univariate graphs).

Examine both their center and spread.

STEP 2: Create a graph showing the association between your explanatory and response variables (bivariate graph).

Your output should be interpretable (i.e. organized and labeled).

I have already performed several such visualizations during weeks 2 and 3. If you have not done so already, I encourage you to skim through the material for those weeks. Those graphs certainly fulfill the requirements of step 1. That being said, I will display summary stats of those variables for this week. Further, at least part of step 2 was fulfilled during my week 3 analysis; however, I will perform additional graphing of bivariate data for this week.

## Summary Stats of Key Variables

In :
```# view summary stats of key variables
sub1.describe()
```
Out:
Poverty ChildPoverty Violent Crime Rate Property Crime Rate Employee Rate
count 2083.000000 2083.000000 2083.000000 2083.000000 2083.000000
mean 16.400192 22.834181 97.831114 669.411407 182.448320
std 6.125008 9.642736 108.288420 553.196492 167.197381
min 1.400000 0.000000 0.000000 0.000000 10.940000
25% 12.000000 16.100000 27.240000 281.565000 97.145000
50% 15.900000 22.300000 64.850000 551.620000 152.970000
75% 20.000000 28.900000 130.490000 915.810000 223.210000
max 45.600000 63.800000 1007.450000 5784.060000 3221.650000

## Correlations Between Key Variables

In :
```# plot correlations of key variables
sns.pairplot(sub1.drop('State', axis=1), plot_kws={'color':'steelblue','lw':0,'alpha':0.5},
diag_kws={'color':'darkgray','edgecolor':'white','lw':0.25})
plt.savefig('pairplot_key_variables.png')``` ## Stats by Poverty Group

In :
```# add poverty groups (<=16% or >16%) to sub1
def poverty_group(row):
if row['Poverty'] <= 16:
return '<= 16%'
else:
return '> 16%'

sub1['Poverty Group'] = sub1.apply(lambda row: poverty_group(row), axis=1)

# check
sub1[['County','Poverty','Poverty Group']].head()```
County Poverty Poverty Group
0 Baldwin 13.4 <= 16%
1 Bibb 16.8 > 16%
2 Bullock 24.6 > 16%
3 Clay 16.7 > 16%
4 Cleburne 17.0 > 16%

In :

```# how many counties are represented in each poverty group?
poverty_group_count = sub1['Poverty Group'].value_counts(sort=False)
poverty_group_count
```
Out:
```> 16%     1008
<= 16%    1075
Name: Poverty Group, dtype: int64```

[Input for 68-69 have been removed due to irrelevancy.]

In :
```# plot avg. violent crime rate by poverty group
sns.factorplot(x='Poverty Group', y='Violent Crime Rate', data=sub1, kind='bar', ci=False, color='steelblue')
plt.title('Avg. Violent Crime Rate by Grouping', loc='left', fontweight='bold', y=1.02)
plt.savefig('avg_vcr_grouping.png')``` ## Hypothesis Test of Average Violent Crime Rate by Poverty Group

I want to determine whether or not the difference in averages between these two poverty groups is statistically significant. I will use a significance level (alpha) of 0.05. My hypotheses statements are as follows:

H0 (null hypothesis): m1 = m2

H1 (alternative hypothesis): m1 != m2

In :
```# assign variables
group1_vcr = sub1[sub1['Poverty Group']=='<= 16%']['Violent Crime Rate']
group2_vcr = sub1[sub1['Poverty Group']=='> 16%']['Violent Crime Rate']

# run test
vcr_group_ztest = ztest(x1=group1_vcr, x2=group2_vcr)

# print results
print('Test Statistic')
print(vcr_group_ztest)
print('\n' + 'P-Value')
print('{:.25f}'.format(float(vcr_group_ztest)))
```
```Test Statistic
-9.89859073599695

P-Value
0.0000000000000000000000422
```

## Violent Crime Rate by Poverty Group Z-Test Results: Reject the Null Hyopthesis

Since the p-value is well below 0.05, we reject the null hypothesis. The difference in means between poverty groups for violent crime rate is statistically significant. Remember that the previous hypothesis test ran on the average violent crime rate between metro and non-metro counties was not statistically significant. This suggests that a stronger relationship exists between poverty and violent crime rate than exists between metro/non-metro and violent crime rate.

In :
```# plot avg. property crime rate by poverty group
sns.factorplot(x='Poverty Group', y='Property Crime Rate', data=sub1, kind='bar', ci=False, color='steelblue')
plt.title('Avg. Property Crime Rate by Grouping', loc='left', fontweight='bold', y=1.02)
plt.savefig('avg_pcr_grouping.png')``` ## Hypothesis Test of Average Property Crime Rate by Poverty Group

I want to determine whether or not the difference in averages between these two poverty groups is statistically significant. I will use a significance level (alpha) of 0.05. My hypotheses statements are as follows:

H0 (null hypothesis): m1 = m2

H1 (alternative hypothesis): m1 != m2

In :
```# assign variables
group1_pcr = sub1[sub1['Poverty Group']=='<= 16%']['Property Crime Rate']
group2_pcr = sub1[sub1['Poverty Group']=='> 16%']['Property Crime Rate']

# run test
pcr_group_ztest = ztest(x1=group1_pcr, x2=group2_pcr)

# print results
print('Test Statistic')
print(pcr_group_ztest)
print('\n' + 'P-Value')
print('{:.25f}'.format(float(pcr_group_ztest)))
```
```Test Statistic
-7.8852829078641635

P-Value
0.0000000000000031382371909
```

## Property Crime Rate by Poverty Group Z-Test Results: Reject the Null Hypothesis

Since the p-value is less than 0.05, we reject the null hypothesis. The difference in means between poverty groups for property crime rate is statistically significant.

## Employee Rate by Violent Crime and Property Crime Groups

In :
```# add violent crime rate groups (split by median) to sub1
vcr_med = sub1['Violent Crime Rate'].median()

def vcr_group(row):
if row['Violent Crime Rate'] <= vcr_med:
return 'lower half'
else:
return 'upper half'

sub1['Violent Crime Group'] = sub1.apply(lambda row: vcr_group(row), axis=1)

# check
print(vcr_med)
print(sub1[['County','Violent Crime Rate','Violent Crime Group']].head(10))
```
```64.85
County  Violent Crime Rate Violent Crime Group
0    Baldwin               58.94          lower half
1       Bibb               30.97          lower half
2    Bullock              196.67          upper half
3       Clay               66.48          upper half
4   Cleburne               39.99          lower half
5  Covington               95.02          upper half
6       Dale               68.18          upper half
7     Dallas              192.15          upper half
8     Elmore              115.15          upper half
9     Etowah              108.90          upper half```

In :

```# add property crime rate groups (split by median) to sub1
pcr_med = sub1['Property Crime Rate'].median()

def pcr_group(row):
if row['Property Crime Rate'] <= pcr_med:
return 'lower half'
else:
return 'upper half'

sub1['Property Crime Group'] = sub1.apply(lambda row: pcr_group(row), axis=1)

# check
print(pcr_med)
print(sub1[['County','Property Crime Rate','Property Crime Group']].head(10))
```
```551.62
County  Property Crime Rate Property Crime Group
0    Baldwin               332.10           lower half
1       Bibb               181.38           lower half
2    Bullock               486.98           lower half
3       Clay               110.81           lower half
4   Cleburne               953.21           upper half
5  Covington               612.36           upper half
6       Dale               419.12           lower half
7     Dallas              1060.40           upper half
8     Elmore              1412.78           upper half
9     Etowah               540.64           lower half```

In :

```# plot avg. employee rate by violent crime group
sns.factorplot(x='Violent Crime Group', y='Employee Rate', data=sub1, kind='bar', ci=False, color='steelblue')
plt.title('Avg. Employee Rate by Grouping', loc='left', fontweight='bold', y=1.02)
plt.savefig('avg_er_vcr_grouping.png')``` ## Hypothesis Test of Average Law Enforcement Employee Rate by Violent Crime Group

I want to determine whether or not the difference in means between these two violent crime rate groups is statistically significant. I will use a significance level (alpha) of 0.05. My hypotheses statements are as follows:

H0 (null hypothesis): m1 = m2

H1 (alternative hypothesis): m1 != m2

In :
```# assign variables
group1_vcr_er = sub1[sub1['Violent Crime Group']=='lower half']['Employee Rate']
group2_vcr_er = sub1[sub1['Violent Crime Group']=='upper half']['Employee Rate']

# run test
vcr_er_group_ztest = ztest(x1=group1_vcr_er, x2=group2_vcr_er)

# print results
print('Test Statistic')
print(vcr_er_group_ztest)
print('\n' + 'P-Value')
print('{:.25f}'.format(float(vcr_er_group_ztest)))
```
```Test Statistic
-10.959943777874312

P-Value
0.0000000000000000000000000
```

## Law Enforcement Employee Rate by Violent Crime Rate Group Z-Test Results: Reject the Null Hypothesis

Since the p-value is well below 0.05, we reject the null hypothesis. The difference between the means of the two groups is statistically significant.

In :
```# plot avg. employee rate by property crime group
sns.factorplot(x='Property Crime Group', y='Employee Rate', data=sub1, kind='bar', ci=False, color='steelblue')
plt.title('Avg. Employee Rate by Grouping', loc='left', fontweight='bold', y=1.02)
plt.savefig('avg_er_pcr_grouping.png')``` ## Hypothesis Test of Average Law Enforcement Rate by Property Crime Group

I want to determine whether or not the difference in means between these two property crime rate groups is statistically significant. I will use a significance level (alpha) of 0.05. My hypotheses statements are as follows:

H0 (null hypothesis): m1 = m2

H1 (alternative hypothesis): m1 != m2

In :
```# assign variables
group1_pcr_er = sub1[sub1['Property Crime Group']=='lower half']['Employee Rate']
group2_pcr_er = sub1[sub1['Property Crime Group']=='upper half']['Employee Rate']

# run test
pcr_er_group_ztest = ztest(x1=group1_pcr_er, x2=group2_pcr_er)

# print results
print('Test Statistic')
print(pcr_er_group_ztest)
print('\n' + 'P-Value')
print('{:.25f}'.format(float(pcr_er_group_ztest)))
```
```Test Statistic
-12.87329802934987

P-Value
0.0000000000000000000000000
```

## Law Enforcement Rate by Property Crime Z-Test Results: Reject the Null Hypothesis

Since the p-value is well below 0.05, we reject the null hypothesis. The difference between the means of the two groups is statistically significant.

## Summary of Analysis

This week we saw that, on average, counties with a higher percentage of poverty have higher violent crime and property crime rates. We also saw that, on average, counties with higher crime rates have higher rates of employment for full-time law enforcement.

|