Data Analysis & Interpretation 4.3: Predicting Employed Rate with Lasso Regression

Week 3

Run a lasso regression analysis using k-fold cross validation to identify a subset of predictors from a larger pool of predictor variables that best predicts a quantitative response variable.

The Features and Target of My Model

The target variable (what’s being predicted) is going to be Employed Rate. Since Lasso works best with many features and does its own feature selection by shrinking the regression coefficients of the features (those with a coefficient of 0 are excluded from the model), I want to utilize a larger number of features than I have previously been using.

Prep

In [179]:
# create subset of data
sub15 = df.loc[:,('Metropolitan','White','Black','Hispanic','Native','Asian','Pacific','Poverty','ChildPoverty',
                 'Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk',
                'OtherTransp','WorkAtHome','MeanCommute','PrivateWork','PublicWork','SelfEmployed',
                'FamilyWork','Unemployment','Employee Rate')]
sub15['Crime Rate'] = sub1['Crime Rate']
sub15['South'] = sub14['South']
sub15['Northeast'] = sub14['Northeast']
sub15['Midwest'] = sub14['Midwest']
sub15['West'] = sub14['West']
sub15['Employed Rate'] = sub14['Employed Rate']
In [180]:
sub15.head()
Out[180]:
Metropolitan White Black Hispanic Native Asian Pacific Poverty ChildPoverty Professional SelfEmployed FamilyWork Unemployment Employee Rate Crime Rate South Northeast Midwest West Employed Rate
0 1 83.1 9.5 4.5 0.6 0.7 0.0 13.4 19.2 33.1 5.8 0.4 7.5 147.60 391.04 1 0 0 0 44051.13
1 1 74.5 21.4 2.2 0.4 0.1 0.0 16.8 27.9 21.5 6.7 0.4 8.3 53.09 212.35 1 0 0 0 36692.62
2 0 22.2 70.7 4.4 1.2 0.2 0.0 24.6 38.4 18.8 5.4 0.0 18.0 121.75 683.65 1 0 0 0 36195.92
3 0 79.9 14.4 3.2 0.7 0.0 0.0 16.7 22.5 21.5 7.8 0.0 9.4 184.68 177.29 1 0 0 0 38265.49
4 0 92.5 2.9 2.3 0.2 0.4 0.0 17.0 26.3 28.9 8.4 0.0 8.3 179.98 993.20 1 0 0 0 40427.94

5 rows × 33 columns

 

In [181]:
# separate features and target
features = sub15.drop('Employed Rate', axis=1)
y = sub15['Employed Rate']
In [182]:
# standardize the features
from sklearn import preprocessing

X_scaled = features.copy()
X_scaled = X_scaled.apply(lambda x: preprocessing.scale(x).astype('float64'), axis=0)
X_scaled.head()
Metropolitan White Black Hispanic Native Asian Pacific Poverty ChildPoverty Professional PublicWork SelfEmployed FamilyWork Unemployment Employee Rate Crime Rate South Northeast Midwest West
0 1.309876 0.209502 0.182603 -0.316518 -0.162429 -0.205435 -0.34949 -0.489944 -0.376973 0.360065 -0.811076 -0.589856 0.213812 -0.029628 -0.208476 -0.600790 1.088912 -0.256941 -0.712205 -0.41019
1 1.309876 -0.258333 1.171681 -0.488479 -0.195342 -0.476472 -0.34949 0.065290 0.525477 -1.545652 -0.166732 -0.362675 0.213812 0.211680 -0.773872 -0.886156 1.088912 -0.256941 -0.712205 -0.41019
2 -0.763431 -3.103421 5.269289 -0.323995 -0.063692 -0.431299 -0.34949 1.339064 1.614641 -1.989224 -0.336296 -0.690826 -0.657753 3.137533 -0.363121 -0.133496 1.088912 -0.256941 -0.712205 -0.41019
3 -0.763431 0.035424 0.589871 -0.413714 -0.145973 -0.521644 -0.34949 0.048960 -0.034665 -1.545652 -0.404122 -0.085010 -0.657753 0.543478 0.013351 -0.942146 1.088912 -0.256941 -0.712205 -0.41019
4 -0.763431 0.720856 -0.365961 -0.481003 -0.228254 -0.340953 -0.34949 0.097951 0.359509 -0.329936 -0.302383 0.066444 -0.657753 0.211680 -0.014766 0.360851 1.088912 -0.256941 -0.712205 -0.41019

5 rows × 32 columns

Train, Test, Split…Run Model

In [183]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=123)
In [184]:
# import lasso regression
from sklearn.linear_model import LassoLarsCV
In [185]:
# run lasso with 10 cross-validations
lasso = LassoLarsCV(cv=10, precompute=False).fit(X_train, y_train)
In [186]:
# view scores for train and test set and number of features used
print('Training set score: {:.2f}'.format(lasso.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso.score(X_test, y_test)))
print('Number of features used: {} out of '.format(np.sum(lasso.coef_ != 0))+str(len(X_scaled.columns)))
Training set score: 0.75
Test set score: 0.71
Number of features used: 20 out of 32
In [187]:
# print variable names and regression coefficients; 0's are excluded from model
dict(zip(X_scaled.columns, lasso.coef_))
Out[187]:
{'Asian': 0.0,
 'Black': 314.37091751912607,
 'Carpool': 0.0,
 'ChildPoverty': -974.9024345131245,
 'Construction': -359.3528086639546,
 'Crime Rate': 0.0,
 'Drive': 0.0,
 'Employee Rate': -44.09934195571289,
 'FamilyWork': 0.0,
 'Hispanic': 0.0,
 'MeanCommute': -291.2754936955053,
 'Metropolitan': 448.9349619545055,
 'Midwest': 790.0667412728335,
 'Native': 86.17558901400935,
 'Northeast': 262.71167899048606,
 'Office': 118.51988528170489,
 'OtherTransp': 0.0,
 'Pacific': 0.0,
 'Poverty': -804.4795943620829,
 'PrivateWork': 298.9441368836553,
 'Production': 0.0,
 'Professional': 764.7625190669488,
 'PublicWork': -1006.8078771461334,
 'SelfEmployed': 0.0,
 'Service': -415.5981058864657,
 'South': -252.5886384929512,
 'Transit': 216.48681387753945,
 'Unemployment': -2210.6706416611964,
 'Walk': 713.5344485085071,
 'West': 0.0,
 'White': -403.3474763837396,
 'WorkAtHome': 0.0}
In [188]:
# calculate mean squared error of train and test sets
train_mse = metrics.mean_squared_error(y_train, lasso.predict(X_train))
test_mse = metrics.mean_squared_error(y_test, lasso.predict(X_test))
print('Training MSE: ' + str(train_mse))
print('Test MSE: ' + str(test_mse))
Training MSE: 9713781.41543768
Test MSE: 10653487.948970526
In [189]:
# calculate the root of the mean squared error
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
print('Training RMSE: ' + str(train_rmse))
print('Test RMSE: ' + str(test_rmse))
Training RMSE: 3116.6939880966306
Test RMSE: 3263.9681292822893
In [190]:
plt.scatter(range(len(y_test)), abs(y_test - lasso.predict(X_test)), s=15)
plt.axhline(test_rmse, color='black')
sns.despine()
plt.title('Abs. Error of Employed Rate w/ RMSE Line', loc='left', fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('abs_err_employed_rate.png')

abs_err_employed_rate

Summary of Lasso Regression for Predicting Employed Rate

The Lasso Regression model utilized 20 out of the 32 original features to predict the Employed Rate. The training set scored 0.75 (or 75% of the variation in the response variable is explained by the model) and the test set scored 0.71 (or 71% of the variation in the response variable is explained by the model). The root of the mean squared error, which puts the unit of measurement back to the original unit of measurement, is 3,263 for the test set, meaning that, on average, the model prediction was off by 3,263 for the number of people employed for every 100,000.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s