Data Analysis & Interpretation 4.1: Predicting County Poverty Group with Decision Tree

Course 4: Machine Learning for Data Analysis

This course focuses on various machine learning algorithms: Decision Trees, Random Forests, Lasso Regression, and K-Mean Cluster Analysis.

Week 1

Run a Classification Tree.

You will need to perform a decision tree analysis to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable.

The Features and Target of My Model

The features of my model will be Crime Rate, Region, Metropolitan (1=Metropolitan; 0=Non-metropolitan), Employed Rate (will calculate below), White, Black, and Hispanic. I will use these features to predict the Poverty Group of a county. Recall that the Poverty Group is either <= 16% or > 16%. I will recode the target so that 1 equals > 16% and 0 equals <= 16%. In other words, I want to predict if the county has a poverty percentage greater than the median poverty. I will also have to create dummy variables for Region. The ethnic variables are percentages, so I don’t need to create dummy variables for those.

Prep

In [155]:
# create subset
sub14 = sub1.loc[:,('Poverty Group','Metropolitan','Region','Crime Rate')]
sub14['White'] = df.loc[:,'White']
sub14['Black'] = df.loc[:,'Black']
sub14['Hispanic'] = df.loc[:,'Hispanic']
sub14['Employed Rate'] = round(df['Employed'] / df['TotalPop'] * 100000, 2)
sub14.rename(columns={'Poverty Group':'Target'}, inplace=True)
In [156]:
sub14.head()
Out[156]:
Target Metropolitan Region Crime Rate White Black Hispanic Employed Rate
0 <= 16% 1 South 391.04 83.1 9.5 4.5 44051.13
1 > 16% 1 South 212.35 74.5 21.4 2.2 36692.62
2 > 16% 0 South 683.65 22.2 70.7 4.4 36195.92
3 > 16% 0 South 177.29 79.9 14.4 3.2 38265.49
4 > 16% 0 South 993.20 92.5 2.9 2.3 40427.94
In [157]:
# recode the target variable
def recode_target(row):
    if row['Target'] == '> 16%':
        return 1
    else:
        return 0
    
sub14['Target'] = sub14.apply(lambda row: recode_target(row), axis=1)
In [158]:
sub14.head()
Out[158]:
Target Metropolitan Region Crime Rate White Black Hispanic Employed Rate
0 0 1 South 391.04 83.1 9.5 4.5 44051.13
1 1 1 South 212.35 74.5 21.4 2.2 36692.62
2 1 0 South 683.65 22.2 70.7 4.4 36195.92
3 1 0 South 177.29 79.9 14.4 3.2 38265.49
4 1 0 South 993.20 92.5 2.9 2.3 40427.94
In [159]:
# create dummy variables for Region
region_dum = pd.get_dummies(sub14['Region'])

# add dummy variables to dataframe
sub14 = pd.concat([sub14, region_dum], axis=1)

sub14.head()
Out[159]:
Target Metropolitan Region Crime Rate White Black Hispanic Employed Rate Midwest Northeast South West
0 0 1 South 391.04 83.1 9.5 4.5 44051.13 0 0 1 0
1 1 1 South 212.35 74.5 21.4 2.2 36692.62 0 0 1 0
2 1 0 South 683.65 22.2 70.7 4.4 36195.92 0 0 1 0
3 1 0 South 177.29 79.9 14.4 3.2 38265.49 0 0 1 0
4 1 0 South 993.20 92.5 2.9 2.3 40427.94 0 0 1 0

Now that I have the region dummy variables added to the data, I can remove the Region variables.

 

In [160]:
# remove Region
sub14.drop('Region', axis=1, inplace=True)

sub14.head()
Out[160]:
Target Metropolitan Crime Rate White Black Hispanic Employed Rate Midwest Northeast South West
0 0 1 391.04 83.1 9.5 4.5 44051.13 0 0 1 0
1 1 1 212.35 74.5 21.4 2.2 36692.62 0 0 1 0
2 1 0 683.65 22.2 70.7 4.4 36195.92 0 0 1 0
3 1 0 177.29 79.9 14.4 3.2 38265.49 0 0 1 0
4 1 0 993.20 92.5 2.9 2.3 40427.94 0 0 1 0

Scaling

Since Crime Rate has a lot of variation, or a large range, I want to scale the variable. The same is true for the ethnic variables and Employed Rate. Standardizing the variables like this should improve the performance of the model. I will use the min-max scaler approach, as this is said to be better for data that is not normally distributed.

 

In [161]:
# scale Crime Rate
cr_min = sub14['Crime Rate'].min()
cr_max = sub14['Crime Rate'].max()
sub14['Crime Rate'] = (sub14['Crime Rate'] - cr_min) / (cr_max - cr_min)

sub14.head()
Out[161]:
Target Metropolitan Crime Rate White Black Hispanic Employed Rate Midwest Northeast South West
0 0 1 0.066137 83.1 9.5 4.5 44051.13 0 0 1 0
1 1 1 0.035915 74.5 21.4 2.2 36692.62 0 0 1 0
2 1 0 0.115626 22.2 70.7 4.4 36195.92 0 0 1 0
3 1 0 0.029985 79.9 14.4 3.2 38265.49 0 0 1 0
4 1 0 0.167981 92.5 2.9 2.3 40427.94 0 0 1 0
In [162]:
# scale ethnic variables
w_min = sub14['White'].min()
w_max = sub14['White'].max()
sub14['White'] = (sub14['White'] - w_min) / (w_max - w_min)

b_min = sub14['Black'].min()
b_max = sub14['Black'].max()
sub14['Black'] = (sub14['Black'] - b_min) / (b_max - b_min)

h_min = sub14['Hispanic'].min()
h_max = sub14['Hispanic'].max()
sub14['Hispanic'] = (sub14['Hispanic'] - h_min) / (h_max - h_min)

sub14.head()
Out[162]:
Target Metropolitan Crime Rate White Black Hispanic Employed Rate Midwest Northeast South West
0 0 1 0.066137 0.831143 0.125828 0.045593 44051.13 0 0 1 0
1 1 1 0.035915 0.744186 0.283444 0.022290 36692.62 0 0 1 0
2 1 0 0.115626 0.215369 0.936424 0.044580 36195.92 0 0 1 0
3 1 0 0.029985 0.798787 0.190728 0.032421 38265.49 0 0 1 0
4 1 0 0.167981 0.926188 0.038411 0.023303 40427.94 0 0 1 0
In [163]:
# scale Employed Rate
emp_min = sub14['Employed Rate'].min()
emp_max = sub14['Employed Rate'].max()
sub14['Employed Rate'] = (sub14['Employed Rate'] - emp_min) / (emp_max - emp_min)

sub14.head()
Out[163]:
Target Metropolitan Crime Rate White Black Hispanic Employed Rate Midwest Northeast South West
0 0 1 0.066137 0.831143 0.125828 0.045593 0.604179 0 0 1 0
1 1 1 0.035915 0.744186 0.283444 0.022290 0.440855 0 0 1 0
2 1 0 0.115626 0.215369 0.936424 0.044580 0.429831 0 0 1 0
3 1 0 0.029985 0.798787 0.190728 0.032421 0.475765 0 0 1 0
4 1 0 0.167981 0.926188 0.038411 0.023303 0.523762 0 0 1 0

Train, Test, and Split…Run Model

In [164]:
# import requisites
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics as metrics
In [165]:
# create X (features) and y (target) variables
X = sub14.drop('Target', axis=1)
y = sub14['Target']

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# print number of rows for each set
print('X_train')
print(len(X_train))
print('')
print('y_train')
print(len(y_train))
print('')
print('X_test')
print(len(X_test))
print('')
print('y_test')
print(len(y_test))
X_train
1458

y_train
1458

X_test
625

y_test
625
In [166]:
# implement model
dtc = DecisionTreeClassifier()  # initiates model
dtc = dtc.fit(X_train, y_train)  # trains the model
y_pred = dtc.predict(X_test)  # makes predictions on test features

# view confusion matrix
dtc_matrix = metrics.confusion_matrix(y_test, y_pred)
dtc_matrix = pd.DataFrame(dtc_matrix, index=['True','False'], columns=['Pred. True','Pred. False'])
print(dtc_matrix)
       Pred. True  Pred. False
True          252           74
False          89          210
In [167]:
# view accuracy
dtc_accuracy = metrics.accuracy_score(y_test, y_pred)
dtc_accuracy
Out[167]:
0.7392

Summary of Predicting County Poverty Group

From the confusion matrix, we see that the model accurately predicted 251 counties as having poverty > 16% (true positive) and 213 counties as having poverty <= 16% (true negative). The model incorrectly predicted 75 counties as not having poverty > 16% (false negative) and 86 counties as having poverty > 16% (false positive). The model accuracy run on our test data is 0.74 or 74%. In other words, the decision tree classifier model accurately predicted the poverty group of 74% of counties.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s