Predicting CO2 Emissions from Vehicles with Multivariate Linear Regression

The data is provided by the government of Canada and provide model-specific fuel consumption ratings and estimated carbon dioxide (CO2) emissions for new light-duty vehicles for retail sale in Canada. For this project I utilized the datasets from 2010 to 2018. What I want to see is how well CO2 emissions from these vehicles can be predicted utilizing multivariate linear regression, as there does exist a linear relationship between the variables as will be evident below.

 

In [1]:
# setup environment
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data Prep

In [2]:
# read in datasets
data1 = pd.read_csv('MY2010 Fuel Consumption Ratings 5-cycle.csv', encoding='1252')
data2 = pd.read_csv('MY2011 Fuel Consumption ratings 5-cycle.csv', encoding='1252')
data3 = pd.read_csv('MY2012 Fuel Consumption Ratings 5-cycle.csv', encoding='1252')
data4 = pd.read_csv('MY2013 Fuel Consumption Ratings (5-cycle).csv', encoding='1252')
data5 = pd.read_csv('MY2014 Fuel Consumption Ratings (5-cycle).csv', encoding='1252')
data6 = pd.read_csv('MY2015 Fuel Consumption Ratings (5-cycle).csv', encoding='1252')
data7 = pd.read_csv('MY2016 Fuel Consumption Ratings.csv', encoding='1252')
data8 = pd.read_csv('MY2017 Fuel Consumption Ratings.csv', encoding='1252')
data9 = pd.read_csv('MY2018 Fuel Consumption Ratings.csv', encoding='1252')
In [3]:
# combine datasets into one
df = pd.concat([data1,data2,data3,data4,data5,data6,data7,data8,data9])
In [4]:
# view head of df
df.head()
Out[4]:
Model Yr Make Model Vehicle Class Engine Size Cylinders Transmission Fuel Type CITY (L/100 km) HWY (L/100 km) COMB (L/100 km) COMB (mpg) CO2 Emissions (g/km)
0 2010 ACURA CSX COMPACT 2.0 4 AS5 X 10.9 7.8 9.5 30 219
1 2010 ACURA CSX COMPACT 2.0 4 M5 X 10.0 7.6 8.9 32 205
2 2010 ACURA CSX COMPACT 2.0 4 M6 Z 11.6 8.1 10.0 28 230
3 2010 ACURA MDX AWD SUV 3.7 6 AS6 Z 14.8 11.3 13.2 21 304
4 2010 ACURA RDX AWD TURBO SUV 2.3 4 AS5 Z 13.2 10.3 11.9 24 274
In [5]:
# view tail of df
df.tail()
Out[5]:
Model Yr Make Model Vehicle Class Engine Size Cylinders Transmission Fuel Type CITY (L/100 km) HWY (L/100 km) COMB (L/100 km) COMB (mpg) CO2 Emissions (g/km)
1062 2018 VOLVO V90 CC T6 AWD STATION WAGON – MID-SIZE 2.0 4 AS8 Z 10.9 8.0 9.6 29 224
1063 2018 VOLVO XC60 T5 AWD SUV – SMALL 2.0 4 AS8 Z 10.7 8.5 9.8 29 228
1064 2018 VOLVO XC60 T6 AWD SUV – SMALL 2.0 4 AS8 Z 11.4 8.7 10.2 28 240
1065 2018 VOLVO XC90 T5 AWD SUV – STANDARD 2.0 4 AS8 Z 10.9 8.3 9.7 29 227
1066 2018 VOLVO XC90 T6 AWD SUV – STANDARD 2.0 4 AS8 Z 11.5 8.8 10.3 27 240
In [6]:
# nulls
df.isnull().sum()
Out[6]:
Model Yr                0
Make                    0
Model                   0
Vehicle Class           0
Engine Size             0
Cylinders               0
Transmission            0
Fuel Type               0
CITY (L/100 km)         0
HWY (L/100 km)          0
COMB (L/100 km)         0
COMB (mpg)              0
CO2 Emissions (g/km)    0
dtype: int64
In [7]:
# only keep relevant columns
# I don't need CITY, HWY, and COMB features; either CITY and HWY or just COMB
# I will use COMB
df_clean = df[['Engine Size','Cylinders','Fuel Type','COMB (L/100 km)','CO2 Emissions (g/km)']]
df_clean.head()
Out[7]:
Engine Size Cylinders Fuel Type COMB (L/100 km) CO2 Emissions (g/km)
0 2.0 4 X 9.5 219
1 2.0 4 X 8.9 205
2 2.0 4 Z 10.0 230
3 3.7 6 Z 13.2 304
4 2.3 4 Z 11.9 274
In [8]:
# create dummy variables for Fuel Type
fuel_dummies = pd.get_dummies(df_clean['Fuel Type'], drop_first=True)

# add to df_clean
df_clean = pd.concat([df_clean, fuel_dummies], axis=1)

# drop Fuel Type
df_clean.drop('Fuel Type', axis=1, inplace=True)

# summarize
df_clean.head()
Out[8]:
Engine Size Cylinders COMB (L/100 km) CO2 Emissions (g/km) E N X Z
0 2.0 4 9.5 219 0 0 1 0
1 2.0 4 8.9 205 0 0 1 0
2 2.0 4 10.0 230 0 0 0 1
3 3.7 6 13.2 304 0 0 0 1
4 2.3 4 11.9 274 0 0 0 1
In [9]:
# shape of df
df_clean.shape
Out[9]:
(9720, 8)
In [10]:
# data types
df_clean.dtypes
Out[10]:
Engine Size             float64
Cylinders                 int64
COMB (L/100 km)         float64
CO2 Emissions (g/km)      int64
E                         uint8
N                         uint8
X                         uint8
Z                         uint8
dtype: object

Data Exploration

In [11]:
# plot pairplot, exclude the dummy variables
sns.pairplot(df_clean.drop(['E','N','X','Z'], axis=1), palette='steelblue', 
             diag_kws=dict(color='gray',alpha=0.5, edgecolor='white',lw=0.3))
plt.savefig('co2emissions_pairplot.png')

co2emissions_pairplot

Train, Test, Split

In [12]:
# import train, test, split
from sklearn.model_selection import train_test_split

# define X and y variables
X = df_clean.drop('CO2 Emissions (g/km)', axis=1)
y = df_clean['CO2 Emissions (g/km)']

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
In [13]:
X_train.shape
Out[13]:
(6804, 7)
In [14]:
X_test.shape
Out[14]:
(2916, 7)

Model

In [15]:
# import model
from sklearn.linear_model import LinearRegression
In [16]:
# instantiate and fit model
reg = LinearRegression().fit(X_train, y_train)
In [17]:
# score model
print('Training set score: {:.3f}'.format(reg.score(X_train, y_train)))
print('Test set score: {:.3f}'.format(reg.score(X_test, y_test)))
Training set score: 0.989
Test set score: 0.988

Model Score

About 99% of the variation in the response variable (CO2 Emissions) is explained by the independent variables in the model.

 

In [18]:
# prediction
y_pred = reg.predict(X_test)
In [19]:
# plot distribution of residuals
y_diff = y_pred - y_test

plt.hist(y_diff, color='steelblue', edgecolor='white', lw=0.25)
plt.title('Residuals Distribution', loc='left', fontweight='bold', y=1.02)
sns.despine()
plt.savefig('residuals_dist.png')

residuals_dist

Model 2

The first model had a much higher score than I expected. I’m wondering if this is due to performing feature engineering on Fuel Type by converting the feature to dummy variables. Let’s re-run the model, this time without the dummy variables.

 

In [20]:
# define variables
X2 = df_clean.drop(['E','N','X','Z','CO2 Emissions (g/km)'], axis=1)
y2 = df_clean['CO2 Emissions (g/km)']

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=1)
In [21]:
# instantiate and fit model
reg2 = LinearRegression().fit(X2_train, y2_train)
In [22]:
# print scores
print('Training set 2 score: {:.3f}'.format(reg2.score(X2_train, y2_train)))
print('Test set 2 score: {:.3f}'.format(reg2.score(X2_test, y2_test)))
Training set 2 score: 0.864
Test set 2 score: 0.863
In [23]:
# prediction
y2_pred = reg2.predict(X2_test)
In [24]:
# plot distribution of residuals
y2_diff = y2_pred - y2_test

plt.hist(y2_diff, color='steelblue', edgecolor='white', lw=0.25)
plt.title('Residuals Distribution', loc='left', fontweight='bold', y=1.02)
sns.despine()
plt.savefig('residuals_dist2.png')

residuals_dist2

Summary

It turns out that converting Fuel Type to dummy variables proved to strengthen the performance of the model. Without this feature engineering, the model only explained about 86% of the variation in the response variable. We see that the residuals distribution is much more spread out in the second model than in the first. With this feature engineering, the model explained about 99% of the variation in the response variable.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s