Run a Random Forest.
You will need to perform a random forest analysis to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable.
The Features and Target of My Model
These will be the same as the previous week. The target is Poverty Group (1= >16%, 0= <=16%). The features are Crime Rate, Metropolitan, Region, Employed Rate, White, Black, and Hispanic. The data is already prepped from the previous model.
# import requisites from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier
# loop through number of trees for accuracy scores trees = range(50) accuracy_scores =  for i in trees: # instantiate model rfc = RandomForestClassifier(n_estimators=i + 1) rfc = rfc.fit(X_train, y_train) # prediction y_pred = rfc.predict(X_test) # append accuracy to accuracy_scores accuracy = metrics.accuracy_score(y_test, y_pred) accuracy_scores.append(accuracy)
# plot accuracy_scores plt.plot(trees, accuracy_scores)
# run model with n_estimators = 25 rfc25 = RandomForestClassifier(n_estimators=25) rfc25 = rfc25.fit(X_train, y_train) y_pred25 = rfc25.predict(X_test) # confusion matrix rfc25_matrix = metrics.confusion_matrix(y_test, y_pred25) rfc25_matrix = pd.DataFrame(rfc25_matrix, index=['True','False'], columns=['Pred. True','Pred. False']) print(rfc25_matrix)
Pred. True Pred. False True 261 65 False 68 231
# accuracy score rfc25_accuracy = metrics.accuracy_score(y_test, y_pred25) rfc25_accuracy
# run feature importance etc = ExtraTreesClassifier() etc = etc.fit(X_train, y_train) feat_importance = etc.feature_importances_ feat_importance = pd.DataFrame(feat_importance, index=X.columns, columns=['Importance']) feat_importance
Re-run Model with Key Features
The features with the worst importance score are Northeast and West. Let’s remove these two features from the data and re-run the model to see if the accuracy score improves.
# remove other features X = X.drop(columns=['Northeast','West'], axis=1) # train, test, split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
|Metropolitan||Crime Rate||White||Black||Hispanic||Employed Rate||Midwest||South|
# re-run model rfc2 = RandomForestClassifier(n_estimators=25) rfc2 = rfc2.fit(X_train, y_train) y_pred2 = rfc2.predict(X_test)
# confusion matrix rfc2_matrix = metrics.confusion_matrix(y_test, y_pred2) rfc2_matrix = pd.DataFrame(rfc2_matrix, index=['True','False'], columns=['Pred. True','Pred. False']) print(rfc2_matrix)
Pred. True Pred. False True 253 74 False 50 248
# accuracy score rfc2_accuracy = metrics.accuracy_score(y_test, y_pred2) rfc2_accuracy
Summary of Predicting County Poverty Group with Random Forest Classifier
The initial model showed an improvement in accuracy score from our previous decision tree classifier model, from 74% to 79%. After running a test on the importance of the features utilized in the model, we removed the two features of least importance (Northeast and West). After re-running the model, we received an accuracy score of 82%.