Predicting Boston House Prices Using a Linear Regression Model

This machine learning project uses a real-world test dataset for housing statistics in Boston during the 70’s. I used a linear regression model to predict the price of homes based on key features.

 

In [1]:
# Setup environment
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [2]:
# Import the dataset
from sklearn.datasets import load_boston
boston = load_boston()
In [3]:
# View datset information
boston.keys()
Out[3]:
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
In [4]:
boston['feature_names']
Out[4]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
In [5]:
# View DESCR to gain understanding of feature names and dimension of data
print(boston['DESCR'])
Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [6]:
# Create columns as list of features
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT']
In [7]:
# Create dataframe
boston_df = pd.DataFrame(boston['data'], columns=cols)

# Add prices to df; prices are in thousands
boston_df['Price'] = boston['target']
boston_df.head()
Out[7]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Price
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [8]:
# Summary stats of df
boston_df.describe()
Out[8]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Price
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
In [9]:
# Plot pairplot of data
import seaborn as sns
sns.pairplot(boston_df)
plt.savefig('boston_pairplot.png')

boston_pairplot

View Larger Image

In [10]:
fig = plt.subplots(figsize=(12,8))
sns.heatmap(boston_df.corr(), annot=True, cmap='PiYG', center=0)
plt.savefig('boston_corr.png')

boston_corr

View Larger Image

In [11]:

# Limit the data to the following features
boston_ft = boston_df[['LSTAT','RM','AGE','NOX','INDUS','RAD','TAX','PTRATIO','CRIM','ZN']]
boston_ft.head()
Out[11]:
LSTAT RM AGE NOX INDUS RAD TAX PTRATIO CRIM ZN
0 4.98 6.575 65.2 0.538 2.31 1.0 296.0 15.3 0.00632 18.0
1 9.14 6.421 78.9 0.469 7.07 2.0 242.0 17.8 0.02731 0.0
2 4.03 7.185 61.1 0.469 7.07 2.0 242.0 17.8 0.02729 0.0
3 2.94 6.998 45.8 0.458 2.18 3.0 222.0 18.7 0.03237 0.0
4 5.33 7.147 54.2 0.458 2.18 3.0 222.0 18.7 0.06905 0.0
In [12]:
# Split the data into train and test
from sklearn.model_selection import train_test_split
X = boston_ft
y = boston_df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
In [13]:
# Output train and test shapes
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(354, 10)
(354,)
(152, 10)
(152,)
In [14]:
# Instantiate the linear regression model
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
In [15]:
# Fit the model
lm.fit(X_train, y_train)
Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [16]:
# Predict prices
predictions = lm.predict(X_test)
In [17]:
# Plot y_test and predictions
plt.scatter(y_test, predictions, s=4)
plt.savefig('boston_pred.png')

boston_pred1

In [18]:
# Plot distribution of residuals
residuals = y_test - predictions
plt.hist(residuals, edgecolor='white', linewidth=0.25)
sns.despine(left=True, top=True, right=True, bottom=True)
plt.title('Residuals Distribution', loc='left', fontweight='bold', y=1.02)
plt.text(8, 35, '(Actual - Predicted)\nPrice residuals primarily\ncenter around -5 to 2.', fontsize=12)
plt.savefig('boston_residuals.png')
plt.show()

boston_residuals

In [19]:

# Look at error metrics
from sklearn import metrics
mae = metrics.mean_absolute_error(y_test, predictions)
mae
Out[19]:
3.783076861768958
In [20]:
mse = metrics.mean_squared_error(y_test, predictions)
mse
Out[20]:
31.04256862172285
In [21]:
rmse = np.sqrt(mse)
rmse
Out[21]:
5.57158582647013

Results

Splitting the train and test data by 70/30 respectively and limiting the features to LSTAT, RM, AGE, NOX, INDUS, RAD, TAX, PTRATIO, CRIM, and ZN resulted in an MAE = 3.78, MSE = 31.04, and SMSE = 5.57. What kind of results would we get if the features utilized in the model were limited to LSTAT, RM, DIS, and CRIM?

 

In [22]:
# Limit features
boston_ft4 = boston_df[['LSTAT','RM','DIS','CRIM']]
boston_ft4.head()
Out[22]:
LSTAT RM DIS CRIM
0 4.98 6.575 4.0900 0.00632
1 9.14 6.421 4.9671 0.02731
2 4.03 7.185 4.9671 0.02729
3 2.94 6.998 6.0622 0.03237
4 5.33 7.147 6.0622 0.06905
In [23]:
# Split the data; I don't have to change y
X2 = boston_ft4
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=0.3, random_state=10)
In [24]:
# Output train and test shapes
print(X_train2.shape)
print(y_train2.shape)
print(X_test2.shape)
print(y_test2.shape)
(354, 4)
(354,)
(152, 4)
(152,)
In [25]:
# Instantiate new linear regression model
lm2 = LinearRegression()
In [26]:
# Fit the model
lm2.fit(X_train2, y_train2)
Out[26]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [27]:
# Predict prices
predictions2 = lm2.predict(X_test2)
In [28]:
# Plot y_test2 and predictions2
plt.scatter(y_test2, predictions2, s=4)
plt.savefig('boston_pred2.png')

boston_pred2

In [29]:
# Plot distribution of residuals
residuals2 = y_test2 - predictions2
plt.hist(residuals2, edgecolor='white', linewidth=0.25)
sns.despine(left=True, top=True, right=True, bottom=True)
plt.title('Residuals2 Distribution', loc='left', fontweight='bold', y=1.02)
plt.text(8, 35, '(Actual - Predicted)\nPrice residuals primarily\ncenter around -3 to 0,\nbut a little more spread\nthan previous residuals.', fontsize=12)
plt.savefig('boston_residuals2.png')
plt.show()

boston_residuals2

In [30]:

# Look at error metrics
mae2 = metrics.mean_absolute_error(y_test2, predictions2)
mae2
Out[30]:
4.1560903472578135
In [31]:
mse2 = metrics.mean_squared_error(y_test2, predictions2)
mse2
Out[31]:
34.819946398328284
In [32]:
rmse2 = np.sqrt(mse2)
rmse2
Out[32]:
5.900842854908804

Results 2

Splitting the train and test data by 75/25, giving a little extra data for the model to train on, as well as limiting the features to four key features (LSTAT, RM, DIS, and CRIM), resulted in a MAE = 4.16, a MSE = 34.82, and a SMSE = 5.9.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s