Data Analysis & Interpretation 4.2: Predicting County Poverty Group with Random Forest

Week 2

Run a Random Forest.

You will need to perform a random forest analysis to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable.

The Features and Target of My Model

These will be the same as the previous week. The target is Poverty Group (1= >16%, 0= <=16%). The features are Crime Rate, Metropolitan, Region, Employed Rate, White, Black, and Hispanic. The data is already prepped from the previous model.

 

In [168]:
# import requisites
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier

Run Model

In [169]:
# loop through number of trees for accuracy scores
trees = range(50)
accuracy_scores = []

for i in trees:
    # instantiate model
    rfc = RandomForestClassifier(n_estimators=i + 1)
    rfc = rfc.fit(X_train, y_train)
    
    # prediction
    y_pred = rfc.predict(X_test)
    
    # append accuracy to accuracy_scores
    accuracy = metrics.accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)
In [170]:
# plot accuracy_scores
plt.plot(trees, accuracy_scores)

rfc_accuracy_scores

In [171]:

# run model with n_estimators = 25
rfc25 = RandomForestClassifier(n_estimators=25)
rfc25 = rfc25.fit(X_train, y_train)
y_pred25 = rfc25.predict(X_test)

# confusion matrix
rfc25_matrix = metrics.confusion_matrix(y_test, y_pred25)
rfc25_matrix = pd.DataFrame(rfc25_matrix, index=['True','False'], columns=['Pred. True','Pred. False'])
print(rfc25_matrix)
       Pred. True  Pred. False
True          261           65
False          68          231
In [172]:
# accuracy score
rfc25_accuracy = metrics.accuracy_score(y_test, y_pred25)
rfc25_accuracy
Out[172]:
0.7872
In [173]:
# run feature importance
etc = ExtraTreesClassifier()
etc = etc.fit(X_train, y_train)
feat_importance = etc.feature_importances_
feat_importance = pd.DataFrame(feat_importance, index=X.columns, columns=['Importance'])
feat_importance
Out[173]:
Importance
Metropolitan 0.047471
Crime Rate 0.117026
White 0.156405
Black 0.126645
Hispanic 0.123432
Employed Rate 0.295857
Midwest 0.028891
Northeast 0.007053
South 0.086769
West 0.010451

Re-run Model with Key Features

The features with the worst importance score are Northeast and West. Let’s remove these two features from the data and re-run the model to see if the accuracy score improves.

 

In [174]:
# remove other features
X = X.drop(columns=['Northeast','West'], axis=1)

# train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
In [175]:
X.head()
Out[175]:
Metropolitan Crime Rate White Black Hispanic Employed Rate Midwest South
0 1 0.066137 0.831143 0.125828 0.045593 0.604179 0 1
1 1 0.035915 0.744186 0.283444 0.022290 0.440855 0 1
2 0 0.115626 0.215369 0.936424 0.044580 0.429831 0 1
3 0 0.029985 0.798787 0.190728 0.032421 0.475765 0 1
4 0 0.167981 0.926188 0.038411 0.023303 0.523762 0 1
In [176]:
# re-run model
rfc2 = RandomForestClassifier(n_estimators=25)
rfc2 = rfc2.fit(X_train, y_train)
y_pred2 = rfc2.predict(X_test)
In [177]:
# confusion matrix
rfc2_matrix = metrics.confusion_matrix(y_test, y_pred2)
rfc2_matrix = pd.DataFrame(rfc2_matrix, index=['True','False'], columns=['Pred. True','Pred. False'])
print(rfc2_matrix)
       Pred. True  Pred. False
True          253           74
False          50          248
In [178]:
# accuracy score
rfc2_accuracy = metrics.accuracy_score(y_test, y_pred2)
rfc2_accuracy
Out[178]:
0.8016

Summary of Predicting County Poverty Group with Random Forest Classifier

The initial model showed an improvement in accuracy score from our previous decision tree classifier model, from 74% to 79%. After running a test on the importance of the features utilized in the model, we removed the two features of least importance (Northeast and West). After re-running the model, we received an accuracy score of 82%.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s