 # Data Analysis & Interpretation 4.2: Predicting County Poverty Group with Random Forest

## Week 2

Run a Random Forest.

You will need to perform a random forest analysis to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable.

## The Features and Target of My Model

These will be the same as the previous week. The target is Poverty Group (1= >16%, 0= <=16%). The features are Crime Rate, Metropolitan, Region, Employed Rate, White, Black, and Hispanic. The data is already prepped from the previous model.

In :
```# import requisites
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
```

## Run Model

In :
```# loop through number of trees for accuracy scores
trees = range(50)
accuracy_scores = []

for i in trees:
# instantiate model
rfc = RandomForestClassifier(n_estimators=i + 1)
rfc = rfc.fit(X_train, y_train)

# prediction
y_pred = rfc.predict(X_test)

# append accuracy to accuracy_scores
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracy_scores.append(accuracy)
```
In :
```# plot accuracy_scores
plt.plot(trees, accuracy_scores)``` In :

```# run model with n_estimators = 25
rfc25 = RandomForestClassifier(n_estimators=25)
rfc25 = rfc25.fit(X_train, y_train)
y_pred25 = rfc25.predict(X_test)

# confusion matrix
rfc25_matrix = metrics.confusion_matrix(y_test, y_pred25)
rfc25_matrix = pd.DataFrame(rfc25_matrix, index=['True','False'], columns=['Pred. True','Pred. False'])
print(rfc25_matrix)
```
```       Pred. True  Pred. False
True          261           65
False          68          231
```
In :
```# accuracy score
rfc25_accuracy = metrics.accuracy_score(y_test, y_pred25)
rfc25_accuracy
```
Out:
`0.7872`
In :
```# run feature importance
etc = ExtraTreesClassifier()
etc = etc.fit(X_train, y_train)
feat_importance = etc.feature_importances_
feat_importance = pd.DataFrame(feat_importance, index=X.columns, columns=['Importance'])
feat_importance
```
Out:
Importance
Metropolitan 0.047471
Crime Rate 0.117026
White 0.156405
Black 0.126645
Hispanic 0.123432
Employed Rate 0.295857
Midwest 0.028891
Northeast 0.007053
South 0.086769
West 0.010451

## Re-run Model with Key Features

The features with the worst importance score are Northeast and West. Let’s remove these two features from the data and re-run the model to see if the accuracy score improves.

In :
```# remove other features
X = X.drop(columns=['Northeast','West'], axis=1)

# train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
```
In :
```X.head()
```
Out:
Metropolitan Crime Rate White Black Hispanic Employed Rate Midwest South
0 1 0.066137 0.831143 0.125828 0.045593 0.604179 0 1
1 1 0.035915 0.744186 0.283444 0.022290 0.440855 0 1
2 0 0.115626 0.215369 0.936424 0.044580 0.429831 0 1
3 0 0.029985 0.798787 0.190728 0.032421 0.475765 0 1
4 0 0.167981 0.926188 0.038411 0.023303 0.523762 0 1
In :
```# re-run model
rfc2 = RandomForestClassifier(n_estimators=25)
rfc2 = rfc2.fit(X_train, y_train)
y_pred2 = rfc2.predict(X_test)
```
In :
```# confusion matrix
rfc2_matrix = metrics.confusion_matrix(y_test, y_pred2)
rfc2_matrix = pd.DataFrame(rfc2_matrix, index=['True','False'], columns=['Pred. True','Pred. False'])
print(rfc2_matrix)
```
```       Pred. True  Pred. False
True          253           74
False          50          248
```
In :
```# accuracy score
rfc2_accuracy = metrics.accuracy_score(y_test, y_pred2)
rfc2_accuracy
```
Out:
`0.8016`

## Summary of Predicting County Poverty Group with Random Forest Classifier

The initial model showed an improvement in accuracy score from our previous decision tree classifier model, from 74% to 79%. After running a test on the importance of the features utilized in the model, we removed the two features of least importance (Northeast and West). After re-running the model, we received an accuracy score of 82%.

|