Data Analysis & Interpretation 1.2: Reading In and Cleaning Data; Visualizing Distributions

Data Management & Visualization

This is the first course in the Data Analysis & Interpretation Specialization program provided by Wesleyan University on Coursera. This first course is focused on using and developing a data codebook, formulating research questions and hypotheses, selecting key variables from a dataset, and exploring those variables through frequency tables, visuals, and statistical analysis. To view the full series of blog posts associated with this program, click here.

Week 1

Read in the data, clean the data, and explore the relevant variables with frequency tables. I also include histograms and boxplots to better understand the distribution of the variables due to the nature of the data.

 

In [1]:
# setup environment
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
# read in datasets
census = pd.read_csv('acs2015_county_data.csv')
crime = pd.read_csv('combined_states_crime.csv')
employees = pd.read_csv('full_time_le_employees.csv')
In [3]:
# summary of census
census.head()
Out[3]:
CensusId State County TotalPop Men Women Hispanic White Black Native Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
0 1001 Alabama Autauga 55221 26745 28476 2.6 75.8 18.5 0.4 0.5 1.3 1.8 26.5 23986 73.6 20.9 5.5 0.0 7.6
1 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6 1.0 1.4 3.9 26.4 85953 81.5 12.3 5.8 0.4 7.5
2 1005 Alabama Barbour 26932 14497 12435 4.6 46.2 46.7 0.2 1.8 1.5 1.6 24.1 8597 71.8 20.8 7.3 0.1 17.6
3 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 0.6 1.5 0.7 28.8 8294 76.8 16.1 6.7 0.4 8.3
4 1009 Alabama Blount 57710 28512 29198 8.6 87.9 1.5 0.3 0.9 0.4 2.3 34.9 22189 82.0 13.5 4.2 0.4 7.7

5 rows × 37 columns

 

In [4]:
print('Census data has ' + str(census.shape[0]) + ' rows and ' + str(census.shape[1]) + ' columns.')
Census data has 3220 rows and 37 columns.
In [5]:
# summary of crime
crime.head()
Out[5]:
State County Violent Crime Murder and Nonnegligent Manslaughter Rape (revised) Rape (legacy) Robbery Aggravated Assault Property Crime Burglary Larceny-Theft Motor Vehicle Theft Arson
0 Alabama Autauga 69 0 13.0 NaN 6 50 344 111 187 46 NaN
1 Alabama Baldwin 115 0 9.0 NaN 33 73 648 225 390 33 NaN
2 Alabama Bibb 7 0 2.0 NaN 1 4 41 20 18 3 NaN
3 Alabama Blount 204 5 16.0 NaN 5 178 832 247 503 82 NaN
4 Alabama Calhoun 16 0 4.0 NaN 1 11 413 181 225 7 NaN
In [6]:
print('Crime data has ' + str(crime.shape[0]) + ' rows and ' + str(crime.shape[1]) + ' columns.')
Crime data has 2587 rows and 13 columns.
In [7]:
# summary of employees
employees.head()
Out[7]:
State Metropolitan County Total Employees Officers Civilians
0 Alabama 1 Baldwin 288 109 179
1 Alabama 1 Bibb 12 11 1
2 Alabama 1 Colbert 47 31 16
3 Alabama 1 Elmore 105 48 57
4 Alabama 1 Etowah 161 67 94
In [8]:
print('Employees data has ' + str(employees.shape[0]) + ' rows and ' + str(employees.shape[1]) + ' columns.')
Employees data has 2359 rows and 6 columns.

Merging the Datasets

I need to merge the datasets so that I’m only working with one dataset instead of three. As we can see, each dataset has state and county. These are the fields we will merge on. However, notice that each dataset has a different number of rows. Just in summarizing the datasets we can see that they don’t have all of the same counties. I will do an inner-join, meaning the result will be a single dataset consisting of rows where all three datasets match for state and county. This means that some data will be lost, as the census data, for example, has some counties that are not in the other datasets. This is okay, as this will still be a large sample of the greater population.

 

In [9]:
# merge census and crime
data = census.merge(crime, on=['State','County'])

# merge data and employees
data = data.merge(employees, on=['State','County'])
In [10]:
# summary of data
data.head()
Out[10]:
CensusId State County TotalPop Men Women Hispanic White Black Native Aggravated Assault Property Crime Burglary Larceny-Theft Motor Vehicle Theft Arson Metropolitan Total Employees Officers Civilians
0 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6 73 648 225 390 33 NaN 1 288 109 179
1 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 4 41 20 18 3 NaN 1 12 11 1
2 1011 Alabama Bullock 10678 5660 5018 4.4 22.2 70.7 1.2 13 52 34 17 1 NaN 0 13 6 7
3 1027 Alabama Clay 13537 6671 6866 3.2 79.9 14.4 0.7 9 15 4 7 4 NaN 0 25 9 16
4 1029 Alabama Cleburne 15002 7334 7668 2.3 92.5 2.9 0.2 4 143 50 90 3 NaN 0 27 11 16

5 rows × 52 columns

 

In [11]:
print('Data has ' + str(data.shape[0]) + ' rows and ' + str(data.shape[1]) + ' columns.')
Data has 2086 rows and 52 columns.
In [12]:
# view columns
data.columns
Out[12]:
Index(['CensusId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
       'White', 'Black', 'Native', 'Asian', 'Pacific', 'Citizen', 'Income',
       'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
       'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
       'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
       'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
       'SelfEmployed', 'FamilyWork', 'Unemployment', 'Violent Crime',
       'Murder and Nonnegligent Manslaughter', 'Rape (revised)',
       'Rape (legacy)', 'Robbery', 'Aggravated Assault', 'Property Crime',
       'Burglary', 'Larceny-Theft', 'Motor Vehicle Theft', 'Arson',
       'Metropolitan', 'Total Employees', 'Officers', 'Civilians'],
      dtype='object')

Only Keeping Relevant Variables

I don’t need to use all of these variables for my analysis. Even though I don’t plan on using all the variables from the census data, it does contain valuable variables that could be useful for subsetting or grouping calculations. So I will leave those variables alone. I only need to keep ‘Violent Crime’ and ‘Property Crime’ from the crime data and I only need to keep ‘Metropolitan’ (could be useful for segmentation) and ‘Total Employees’ from the employees data.

 

In [13]:
# remove unnecessary variables
keep_cols = ['CensusId', 'State','County','TotalPop','Men','Women','Hispanic','White','Black','Native', 
             'Asian', 'Pacific','Citizen','Income','IncomeErr', 'IncomePerCap','IncomePerCapErr','Poverty', 
             'ChildPoverty','Professional','Service', 'Office','Construction','Production','Drive','Carpool', 
             'Transit','Walk','OtherTransp','WorkAtHome','MeanCommute','Employed','PrivateWork','PublicWork', 
             'SelfEmployed','FamilyWork','Unemployment','Violent Crime','Property Crime','Metropolitan','Total Employees']

df = data[keep_cols].copy()
In [14]:
# summary of df
df.head()
Out[14]:
CensusId State County TotalPop Men Women Hispanic White Black Native Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment Violent Crime Property Crime Metropolitan Total Employees
0 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6 85953 81.5 12.3 5.8 0.4 7.5 115 648 1 288
1 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 8294 76.8 16.1 6.7 0.4 8.3 7 41 1 12
2 1011 Alabama Bullock 10678 5660 5018 4.4 22.2 70.7 1.2 3865 79.5 15.1 5.4 0.0 18.0 21 52 0 13
3 1027 Alabama Clay 13537 6671 6866 3.2 79.9 14.4 0.7 5180 77.5 14.7 7.8 0.0 9.4 9 15 0 25
4 1029 Alabama Cleburne 15002 7334 7668 2.3 92.5 2.9 0.2 6065 76.3 15.3 8.4 0.0 8.3 6 143 0 27

5 rows × 41 columns

 

In [15]:
# how many nulls are there?
df.isna().sum()
Out[15]:
CensusId           0
State              0
County             0
TotalPop           0
Men                0
Women              0
Hispanic           0
White              0
Black              0
Native             0
Asian              0
Pacific            0
Citizen            0
Income             1
IncomeErr          1
IncomePerCap       0
IncomePerCapErr    0
Poverty            0
ChildPoverty       0
Professional       0
Service            0
Office             0
Construction       0
Production         0
Drive              0
Carpool            0
Transit            0
Walk               0
OtherTransp        0
WorkAtHome         0
MeanCommute        0
Employed           0
PrivateWork        0
PublicWork         0
SelfEmployed       0
FamilyWork         0
Unemployment       0
Violent Crime      0
Property Crime     2
Metropolitan       0
Total Employees    0
dtype: int64

Dealing With Nulls

Since there are so few rows with nulls, I will simply remove all rows that have nulls.

 

In [16]:
# remove rows with at least one null value
df = df.dropna()
print('The final dataset has ' + str(df.shape[0]) + ' rows (i.e. counties) and ' + str(df.shape[1]) + ' columns.')
The final dataset has 2083 rows (i.e. counties) and 41 columns.

Data Types

Some variables have a data type of object when it should be an integer. The reason for this is because these columns have values with commas. This is not allowed, as it forces Python to read it as a string. I need to remove the commas by replacing them with nothing (”) and then convert to int.

 

In [17]:
# display dataset info (note the data types)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2083 entries, 0 to 2085
Data columns (total 41 columns):
CensusId           2083 non-null int64
State              2083 non-null object
County             2083 non-null object
TotalPop           2083 non-null int64
Men                2083 non-null int64
Women              2083 non-null int64
Hispanic           2083 non-null float64
White              2083 non-null float64
Black              2083 non-null float64
Native             2083 non-null float64
Asian              2083 non-null float64
Pacific            2083 non-null float64
Citizen            2083 non-null int64
Income             2083 non-null float64
IncomeErr          2083 non-null float64
IncomePerCap       2083 non-null int64
IncomePerCapErr    2083 non-null int64
Poverty            2083 non-null float64
ChildPoverty       2083 non-null float64
Professional       2083 non-null float64
Service            2083 non-null float64
Office             2083 non-null float64
Construction       2083 non-null float64
Production         2083 non-null float64
Drive              2083 non-null float64
Carpool            2083 non-null float64
Transit            2083 non-null float64
Walk               2083 non-null float64
OtherTransp        2083 non-null float64
WorkAtHome         2083 non-null float64
MeanCommute        2083 non-null float64
Employed           2083 non-null int64
PrivateWork        2083 non-null float64
PublicWork         2083 non-null float64
SelfEmployed       2083 non-null float64
FamilyWork         2083 non-null float64
Unemployment       2083 non-null float64
Violent Crime      2083 non-null object
Property Crime     2083 non-null object
Metropolitan       2083 non-null int64
Total Employees    2083 non-null object
dtypes: float64(27), int64(9), object(5)
memory usage: 683.5+ KB
In [18]:
# remove commas from Violent Crime, Property Crime, and Total Employees
# convert to integers
df['Violent Crime'] = df['Violent Crime'].str.replace(',', '').astype(int)
df['Property Crime'] = df['Property Crime'].str.replace(',', '').astype(int)
df['Total Employees'] = df['Total Employees'].str.replace(',', '').astype(int)

Adding Rates

I need to calculate the violent crime rate and property crime rate for each county. I also want to calculate the full-time law enforcement employees rate for each county. This is achieved by dividing the number of crimes, or employees for the employee rate, by the county population and then multiplying the quotient by 100,000.

 

In [19]:
# create violent crime and property crime rates
df['Violent Crime Rate'] = round((df['Violent Crime'] / df['TotalPop']) * 100000, 2)
df['Property Crime Rate'] = round((df['Property Crime'] / df['TotalPop']) * 100000, 2)
df['Employee Rate'] = round((df['Total Employees'] / df['TotalPop']) * 100000, 2)

# check
df[['State','County','TotalPop','Violent Crime','Violent Crime Rate','Property Crime','Property Crime Rate',
    'Total Employees','Employee Rate']].head()
Out[19]:
State County TotalPop Violent Crime Violent Crime Rate Property Crime Property Crime Rate Total Employees Employee Rate
0 Alabama Baldwin 195121 115 58.94 648 332.10 288 147.60
1 Alabama Bibb 22604 7 30.97 41 181.38 12 53.09
2 Alabama Bullock 10678 21 196.67 52 486.98 13 121.75
3 Alabama Clay 13537 9 66.48 15 110.81 25 184.68
4 Alabama Cleburne 15002 6 39.99 143 953.21 27 179.98

States Represented

In [20]:
# how many states are represented in the data?
states = df['State'].unique()
states
Out[20]:
array(['Alabama', 'Arizona', 'Arkansas', 'California', 'Florida',
       'Georgia', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
       'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Michigan',
       'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska',
       'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin'], dtype=object)
In [21]:
print(str(len(states)) + ' of the 50 states (' + str(len(states)/50 * 100) + '%) are represented in the data.')
42 of the 50 states (84.0%) are represented in the data.
In [22]:
# how many counties are represented for each state?
df['State'].value_counts()
Out[22]:
Texas             188
Missouri          107
Kentucky           94
Georgia            92
Tennessee          90
Minnesota          86
Virginia           85
Michigan           79
Oklahoma           73
Wisconsin          71
Iowa               70
North Carolina     67
Nebraska           67
Arkansas           65
California         57
North Dakota       52
Kansas             49
Florida            48
New York           48
Montana            47
Idaho              44
South Dakota       42
Ohio               41
South Carolina     38
Illinois           37
Washington         37
Indiana            32
Pennsylvania       30
West Virginia      29
Louisiana          28
Utah               25
Oregon             25
New Jersey         21
Alabama            21
Maryland           20
Maine              16
Nevada             15
Mississippi        15
Arizona            13
Vermont            10
New Mexico          5
New Hampshire       4
Name: State, dtype: int64

Frequencies/Distributions

I am interested in looking at frequencies of the following variables:

  1. Metropolitan
  2. Poverty (hist)
  3. Child Poverty (hist)
  4. Violent Crime Rate (hist)
  5. Property Crime Rate (hist)
  6. Employee Rate (hist)

Due to the nature of the dataset, most of these variables are better viewed as histograms rather than frequency tables. Frequency tables are best utilized on discrete variables with relatively few unique values or on categorical variables.

 

In [23]:
# metropolitan frequency table
# 1 = True (metropolitan county); 0 = False (non-metropolitan county)
df['Metropolitan'].value_counts()
Out[23]:
0    1316
1     767
Name: Metropolitan, dtype: int64
In [24]:
# metropolitan proportion
df['Metropolitan'].value_counts(normalize=True)
Out[24]:
0    0.631781
1    0.368219
Name: Metropolitan, dtype: float64

63.2% (1,316) of the counties are non-metropolitan. 36.8% (767) are metropolitan. A metropolis is a very large industrialized/commercialized city with a dense population. It would be interesting to compare crime rates between metropolitan and non-metropolitan counties.

 

In [25]:
# histogram function
def plot_hist(col, title, savefigtitle, bins=15, color='steelblue'):
    plt.hist(col, bins=bins, color=color, edgecolor='white', lw=0.25)
    plt.title(title + ' Distribution', loc='left', fontweight='bold', y=1.02)
    sns.despine(left=True, bottom=True)
    plt.tick_params(left=False, bottom=False)
    plt.xlabel(title)
    plt.ylabel('Frequency')
    plt.savefig(savefigtitle)

# poverty distribution (hist)
plot_hist(df['Poverty'], 'Poverty (%)', 'poverty_hist.png')
poverty_hist
In [26]:
# boxplot function
def plot_boxplot(data, col, title, x_range, savefigtitle, color='steelblue', linewidth=1):
    fig = plt.figure(figsize=(15,4))
    sns.boxplot(x=col, data=data, color=color, linewidth=linewidth)
    sns.despine(left=True, bottom=True)
    plt.title(title, loc='left', fontweight='bold', y=1.02)
    plt.xticks(x_range)
    plt.savefig(savefigtitle)
    plt.show()

# poverty distribution (boxplot)
plot_boxplot(df, 'Poverty', 'Poverty', range(0,60,10), 'poverty_box.png')
poverty_box

Fifty percent of the counties have a poverty percentage roughly between 12 and 20. The distribution is slightly right skewed, meaning it has a longer tail to the right. The dots in the boxplot represent outliers. These are counties that have an exceptionally high percentage of poverty in comparison to most other counties.

 

In [27]:
# child poverty distribution (hist)
plot_hist(df['ChildPoverty'], 'Child Poverty (%)', 'child_poverty_hist.png')
child_poverty_hist
In [28]:
# child poverty distribution (boxplot)
plot_boxplot(df, 'ChildPoverty', 'Child Poverty', range(0,80,10), 'child_poverty_box.png')
child_poverty_box

Fifty percent of the counties have a child poverty percentage roughly between 16 and 29. The distribution is very close to a normal distribution with only a slight skewness to the right.

 

In [29]:
# violent crime rate distribution (hist)
plot_hist(df['Violent Crime Rate'], 'Violent Crime Rate', 'vcr_hist.png', bins=15)
vcr_hist
In [30]:
# violent crime rate distribution (boxplot)
plot_boxplot(df, 'Violent Crime Rate', 'Violent Crime Rate', range(0,1100,100), 'vcr_box.png')
vcr_box

The violent crime rate has a strong right-skewed distribution with the majority of counties having a violent crime rate under roughly 135 for every 100,000 people. In fact, the bin with the greatest frequency looks to be less than 70.

 

In [31]:
# property crime rate distribution (hist)
plot_hist(df['Property Crime Rate'], 'Property Crime Rate', 'pcr_hist.png')
pcr_hist
In [32]:
# property crime rate distribution (boxplot)
plot_boxplot(df, 'Property Crime Rate', 'Property Crime Rate', range(0,6250,250), 'pcr_box.png')
pcr_box

The property crime rate also has a strong right-skewed distribution with the majority of counties having a rate less than roughly 900 per 100,000 people. Comparing violent crime rate to property crime rate, we see that property crimes are much more common, as they have a much higher rate of occurence.

 

In [33]:
# full-time law enforcement eployees rate (hist)
plot_hist(df['Employee Rate'], 'Employee Rate', 'er_hist.png')
er_hist
In [34]:
# full-time law enforcement employees rate (boxplot)
plot_boxplot(df, 'Employee Rate', 'Full-Time Law Enforcement Rate', range(0,3500,250), 'le_box.png')
le_box

The employee rate also has a strong right-skewed distribution with the majority of counties having a rate under roughly 250 per 100,000 people.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s