Predicting Survival on the Titanic with Logistic Regression

There are two datasets utilized in this project: train and test. Both are from the Kaggle machine learning training competition, Titanic: Machine Learning from Disaster. In this project, I utilize a logistic regression model to predict whether or not a passenger survived. My accuracy ended up being 75.6% but I plan on revisiting this dataset to improve my accuracy.

 

In [1]:
# Setup encironment
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
# Read in the train and test files
train = pd.read_csv('titanic_train.csv')
test = pd.read_csv('titanic_test.csv')

Exploring the Train Data

In [3]:
train.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [4]:
train.shape
Out[4]:
(891, 12)
In [5]:
test.shape
Out[5]:
(418, 11)

Train vs. Test Data

The train and test is roughly split at 70/30, which is the standard approach of dividing the data in machine learning.

 

In [6]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [7]:
train.describe()
Out[7]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [8]:
# Where are we missing a lot of data?
sns.heatmap(train.isnull(), cbar=False, cmap='gray', yticklabels=False)
plt.title('White space signifies missing data', loc='left', fontweight='bold', y=1.02)
plt.savefig('missing_data_heatmap.png')

missing_data_heatmap

Missing Data

We are missing a lot of data for Cabin and a decent amount in age. I will deal with Age later on by replacing the missing ages with the average age by Pclass. However, it’s really difficult to determine the cabins. Let’s remove Cabin from the data, along with Ticket and Name, as these are not helpful for predicting our target, Survived. Also, PassengerId is simply an id number assigned to each passenger, potentially useful for reporting, but this is not necessary for the algorithm. I will drop that as well. Keep in mind that whatever we do to the train datset we will have to do to the test dataset as well. I will do this simultaneously to ensure accuracy.

 

In [9]:
drop_cols = ['Name','Ticket','Cabin']
train = train.drop(drop_cols, axis=1)
test = test.drop(drop_cols, axis=1)
In [10]:
train.columns
Out[10]:
Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Fare', 'Embarked'],
      dtype='object')
In [11]:
# The test dataset does not have Survived because that is the target variable
test.columns
Out[11]:
Index(['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')
In [12]:
# Print survival counts
train['Survived'].value_counts()
Out[12]:
0    549
1    342
Name: Survived, dtype: int64
In [13]:
# Print survival proportions
train['Survived'].value_counts(normalize=True)
Out[13]:
0    0.616162
1    0.383838
Name: Survived, dtype: float64
In [14]:
# Plot count of survival by sex; 0 = died and 1 = survived
sns.countplot(x='Survived', hue='Sex', data=train, palette='RdBu_r')
sns.despine()
plt.title('Survival by Sex', loc='left', fontweight='bold', y=1.02)
plt.savefig('survival_by_sex.png')

survival_by_sex

In [15]:
# Plot pct survived by sex
sns.barplot(x='Sex', y='Survived', data=train, estimator=lambda x: sum(x==1)/len(x) * 100, palette='RdBu_r')
sns.despine()
plt.title('Pct Survived by Sex', loc='left', fontweight='bold', y=1.02)
plt.ylabel('% Survived')
plt.xlabel('')
plt.savefig('pct_survived_by_sex.png')

pct_survival_by_sex

n [16]:
# Plot survival by passenger class (Pclass)
sns.countplot(x='Survived', hue='Pclass', data=train, palette='Set2')
sns.despine()
plt.title('Survival by Passenger Class', loc='left', fontweight='bold', y=1.02)
plt.savefig('survival_by_pclass.png')

survival_by_pclass

In [17]:
# Plot pct survived by pclass
sns.barplot(x='Pclass', y='Survived', data=train, estimator=lambda x: sum(x==1)/len(x) * 100, palette='Set2')
sns.despine()
plt.title('Pct Survived by Passenger Class', loc='left', fontweight='bold', y=1.02)
plt.ylabel('% Survived')
plt.savefig('pct_survived_by_pclass.png')

pct_survived_by_pclass

In [18]:
# Plot survival by Embarked
sns.countplot(x='Embarked', hue='Survived', data=train, palette=['orange','lightgray'])
sns.despine()
plt.title('Survival by Place of Embarkment', loc='left', fontweight='bold', y=1.02)
plt.savefig('survival_by_embarked.png')

survival_by_embarked

In [19]:

# Create function to replace missing ages with avg age for that Pclass
def replace_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    if pd.isnull(Age):
        if Pclass == 1:
            return train['Age'][train['Pclass']==1].mean()
        if Pclass == 2:
            return train['Age'][train['Pclass']==2].mean()
        if Pclass == 3:
            return train['Age'][train['Pclass']==3].mean()
    else:
        return Age
In [20]:
# Replace missing ages for both the train and test datasets
train['Age'] = train[['Age','Pclass']].apply(replace_age, axis=1)
test['Age'] = test[['Age','Pclass']].apply(replace_age, axis=1)
In [21]:
# Make Age int
train['Age'] = train['Age'].astype(int)
test['Age'] = test['Age'].astype(int)
In [22]:
train.isnull().sum()
Out[22]:
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       2
dtype: int64
In [23]:
test.isnull().sum()
Out[23]:
PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           1
Embarked       0
dtype: int64

Missing Data Update

We no longer have missing ages. There are still 2 rows in the train dataset that are missing Embarked and only 1 row in the test dataset that is missing Fare. Let’s remove the rows from train. I can’t remove rows from test because there must be 418 rows (plus header) for the submission to Kaggle. For that reason, I will replace the missing value for Fare in test with the overall mean value. If there were a lot of missing Fares then I would subset the mean by Pclass.

 

In [24]:
train = train.dropna()
In [25]:
train.isnull().sum()
Out[25]:
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64
In [26]:
# Replace null Fare in test with the overall mean Fare
test_fare_null = test['Fare'].isnull()
test['Fare'][test_fare_null] = test['Fare'].mean()
test.isnull().sum()
In [27]:
# Plot distribution of ages by Survived
sns.distplot(train['Age'][train['Survived']==1], hist=False, bins=range(0,80,5), color='gray', label='Survived')
sns.distplot(train['Age'][train['Survived']==0], hist=False, bins=range(0,80,5), color='orange', label='Died')
plt.legend()
sns.despine()
plt.title('Density Plot of Age by Survived', loc='left', fontweight='bold', y=1.02)
plt.savefig('density_age_survived.png')

density_age_survived

In [28]:
# Plot distribution of ages by Survived
sns.distplot(train['Fare'][train['Survived']==1], hist=False, color='gray', label='Survived')
sns.distplot(train['Fare'][train['Survived']==0], hist=False, color='orange', label='Died')
plt.legend()
sns.despine()
plt.title('Density Plot of Fare by Survived', loc='left', fontweight='bold', y=1.02)
plt.savefig('density_fare_survived.png')

density_fare_survived

Data Exploration Summary

From my analysis of the data, a passenger had a higher probability of dying if three things were met: passenger was 1) male; 2) between 20 and 30; and 3) was in 3rd class. The majority of those who embarked from ‘S’ were likely 3rd class, as they had more passengers who died than survived. Women and children were more likely to survive. This is likely due to women and children being given preference for the life-boats. Additionally, 1st class had a higher percentage of survival than 2nd and 3rd class, and 2nd class had a higher percentage of survival than 1st class. This is likely due to the layering of the cabin floors, with 1st class being near the top of the ship, 3rd class near the bottom, and 2nd inbetween.

Machine algorithms don’t work well with categorical variables. Next, I create dummy variables for the Sex and Embarked features.

 

In [29]:
# Create dummy variables of the Sex and Embarked features in the train dataset
dummy_sex = pd.get_dummies(train['Sex'], drop_first=True) # drop one sex due to collinearity
dummy_embarked = pd.get_dummies(train['Embarked'], drop_first=True)
In [30]:
# Concatenate the dummy features to the train dataset; remove the Sex and Embarked columns
train = pd.concat([train, dummy_sex, dummy_embarked], axis=1)
train = train.drop(['Sex','Embarked'], axis=1)
train.head()
Out[30]:
PassengerId Survived Pclass Age SibSp Parch Fare male Q S
0 1 0 3 22 1 0 7.2500 1 0 1
1 2 1 1 38 1 0 71.2833 0 0 0
2 3 1 3 26 0 0 7.9250 0 0 1
3 4 1 1 35 1 0 53.1000 0 0 1
4 5 0 3 35 0 0 8.0500 1 0 1
In [31]:
# Create dummy variables of the Sex and Embarked features in the test dataset
dummy_sex = pd.get_dummies(test['Sex'], drop_first=True) # drop one sex due to collinearity
dummy_embarked = pd.get_dummies(test['Embarked'], drop_first=True)
In [32]:
# Concatenate the dummy features to the test dataset; remove the Sex and Embarked columns
test = pd.concat([test, dummy_sex, dummy_embarked], axis=1)
test = test.drop(['Sex','Embarked'], axis=1)
test.head()
Out[32]:
PassengerId Pclass Age SibSp Parch Fare male Q S
0 892 3 34 0 0 7.8292 1 1 0
1 893 3 47 1 0 7.0000 0 0 1
2 894 2 62 0 0 9.6875 1 1 0
3 895 3 27 0 0 8.6625 1 0 1
4 896 3 22 1 1 12.2875 0 0 1
In [33]:
# Save test PassengerId, then remove PassengerId from both datasets
test_passengerids = test['PassengerId'] # I need this for the submission process
train = train.drop('PassengerId', axis=1)
test = test.drop('PassengerId', axis=1)

Logistic Regression Model

In [34]:
# Import the model from sklearn
from sklearn.linear_model import LogisticRegression
In [35]:
# Instantiate the logistic regression model
logmodel = LogisticRegression()
In [36]:
# Fit the model
train_feat = train.drop('Survived', axis=1)
train_label = train['Survived']
logmodel.fit(train_feat, train_label)
Out[36]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [37]:
# Predict survival
test_pred = logmodel.predict(test)
In [38]:
test_pred = pd.Series(test_pred, name='pred')

Sanity Check

I would expect the survival counts/proportions in the test data to be similar to the train data. I can do a quick summary of the predictions to see if it correlates well with the train statistics.

 

In [39]:
#test['pred'].value_counts()
test_pred.value_counts()
Out[39]:
0    264
1    154
Name: pred, dtype: int64
In [40]:
#test['pred'].value_counts(normalize=True)
test_pred.value_counts(normalize=True)
Out[40]:
0    0.631579
1    0.368421
Name: pred, dtype: float64

Sanity Check Findings

The train dataset had a survival proportion of 62% died and 38% survived. My predictions are very close to this proportion, with 63% died and 37% survived. This is a good sign that my predictions have a high accuracy.

Submission

The actual Survived data for the test dataset was not provided, so I cannot test the accuracy of my model on it. I need to submit my model predictions to Kaggle in order to learn the accuracy of my predictions, assuming the train dataset was a representative sample.

 

In [41]:
# Create the CSV file to submit to Kaggle
submission_dict = {
    'PassengerId':test_passengerids,
    'Survived':test_pred
}
submission = pd.DataFrame(submission_dict)
submission.head()
Out[41]:
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1
In [42]:
submission.tail()
Out[42]:
PassengerId Survived
413 1305 0
414 1306 1
415 1307 0
416 1308 0
417 1309 0
In [43]:
len(submission)
Out[43]:
418
In [44]:
# Save submission to csv
submission.to_csv('submission.csv', index=False) # Remove index for proper submission to Kaggle

Submission Results

The accuracy of this initial submission, using a logistic regression model, was 0.75598 or 75.6%. I plan on revisiting this dataset as I continue to improve my machine learning skills with the aim of improving the accuracy of my model.

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s