Predicting Survival on the Titanic with Logistic Regression

There are two datasets utilized in this project: train and test. Both are from the Kaggle machine learning training competition, Titanic: Machine Learning from Disaster. In this project, I utilize a logistic regression model to predict whether or not a passenger survived. My accuracy ended up being 75.6% but I plan on revisiting this dataset to improve my accuracy.

In [1]:
```# Setup encironment
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```
In [2]:
```# Read in the train and test files
```

Exploring the Train Data

In [3]:
```train.head()
```
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [4]:
```train.shape
```
Out[4]:
`(891, 12)`
In [5]:
```test.shape
```
Out[5]:
`(418, 11)`

Train vs. Test Data

The train and test is roughly split at 70/30, which is the standard approach of dividing the data in machine learning.

In [6]:
```train.info()
```
```<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
```
In [7]:
```train.describe()
```
Out[7]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [8]:
```# Where are we missing a lot of data?
sns.heatmap(train.isnull(), cbar=False, cmap='gray', yticklabels=False)
plt.title('White space signifies missing data', loc='left', fontweight='bold', y=1.02)
plt.savefig('missing_data_heatmap.png')
```

Missing Data

We are missing a lot of data for Cabin and a decent amount in age. I will deal with Age later on by replacing the missing ages with the average age by Pclass. However, it’s really difficult to determine the cabins. Let’s remove Cabin from the data, along with Ticket and Name, as these are not helpful for predicting our target, Survived. Also, PassengerId is simply an id number assigned to each passenger, potentially useful for reporting, but this is not necessary for the algorithm. I will drop that as well. Keep in mind that whatever we do to the train datset we will have to do to the test dataset as well. I will do this simultaneously to ensure accuracy.

In [9]:
```drop_cols = ['Name','Ticket','Cabin']
train = train.drop(drop_cols, axis=1)
test = test.drop(drop_cols, axis=1)
```
In [10]:
```train.columns
```
Out[10]:
```Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
'Fare', 'Embarked'],
dtype='object')```
In [11]:
```# The test dataset does not have Survived because that is the target variable
test.columns
```
Out[11]:
```Index(['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
'Embarked'],
dtype='object')```
In [12]:
```# Print survival counts
train['Survived'].value_counts()
```
Out[12]:
```0    549
1    342
Name: Survived, dtype: int64```
In [13]:
```# Print survival proportions
train['Survived'].value_counts(normalize=True)
```
Out[13]:
```0    0.616162
1    0.383838
Name: Survived, dtype: float64```
In [14]:
```# Plot count of survival by sex; 0 = died and 1 = survived
sns.countplot(x='Survived', hue='Sex', data=train, palette='RdBu_r')
sns.despine()
plt.title('Survival by Sex', loc='left', fontweight='bold', y=1.02)
plt.savefig('survival_by_sex.png')
```

In [15]:
```# Plot pct survived by sex
sns.barplot(x='Sex', y='Survived', data=train, estimator=lambda x: sum(x==1)/len(x) * 100, palette='RdBu_r')
sns.despine()
plt.title('Pct Survived by Sex', loc='left', fontweight='bold', y=1.02)
plt.ylabel('% Survived')
plt.xlabel('')
plt.savefig('pct_survived_by_sex.png')```

n [16]:
```# Plot survival by passenger class (Pclass)
sns.countplot(x='Survived', hue='Pclass', data=train, palette='Set2')
sns.despine()
plt.title('Survival by Passenger Class', loc='left', fontweight='bold', y=1.02)
plt.savefig('survival_by_pclass.png')```

In [17]:
```# Plot pct survived by pclass
sns.barplot(x='Pclass', y='Survived', data=train, estimator=lambda x: sum(x==1)/len(x) * 100, palette='Set2')
sns.despine()
plt.title('Pct Survived by Passenger Class', loc='left', fontweight='bold', y=1.02)
plt.ylabel('% Survived')
plt.savefig('pct_survived_by_pclass.png')```

In [18]:
```# Plot survival by Embarked
sns.countplot(x='Embarked', hue='Survived', data=train, palette=['orange','lightgray'])
sns.despine()
plt.title('Survival by Place of Embarkment', loc='left', fontweight='bold', y=1.02)
plt.savefig('survival_by_embarked.png')```

In [19]:

```# Create function to replace missing ages with avg age for that Pclass
def replace_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return train['Age'][train['Pclass']==1].mean()
if Pclass == 2:
return train['Age'][train['Pclass']==2].mean()
if Pclass == 3:
return train['Age'][train['Pclass']==3].mean()
else:
return Age
```
In [20]:
```# Replace missing ages for both the train and test datasets
train['Age'] = train[['Age','Pclass']].apply(replace_age, axis=1)
test['Age'] = test[['Age','Pclass']].apply(replace_age, axis=1)
```
In [21]:
```# Make Age int
train['Age'] = train['Age'].astype(int)
test['Age'] = test['Age'].astype(int)
```
In [22]:
```train.isnull().sum()
```
Out[22]:
```PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       2
dtype: int64```
In [23]:
```test.isnull().sum()
```
Out[23]:
```PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           1
Embarked       0
dtype: int64```

Missing Data Update

We no longer have missing ages. There are still 2 rows in the train dataset that are missing Embarked and only 1 row in the test dataset that is missing Fare. Let’s remove the rows from train. I can’t remove rows from test because there must be 418 rows (plus header) for the submission to Kaggle. For that reason, I will replace the missing value for Fare in test with the overall mean value. If there were a lot of missing Fares then I would subset the mean by Pclass.

In [24]:
```train = train.dropna()
```
In [25]:
```train.isnull().sum()
```
Out[25]:
```PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64```
In [26]:
```# Replace null Fare in test with the overall mean Fare
test_fare_null = test['Fare'].isnull()
test['Fare'][test_fare_null] = test['Fare'].mean()
test.isnull().sum()
```
In [27]:
```# Plot distribution of ages by Survived
sns.distplot(train['Age'][train['Survived']==1], hist=False, bins=range(0,80,5), color='gray', label='Survived')
sns.distplot(train['Age'][train['Survived']==0], hist=False, bins=range(0,80,5), color='orange', label='Died')
plt.legend()
sns.despine()
plt.title('Density Plot of Age by Survived', loc='left', fontweight='bold', y=1.02)
plt.savefig('density_age_survived.png')```

In [28]:
```# Plot distribution of ages by Survived
sns.distplot(train['Fare'][train['Survived']==1], hist=False, color='gray', label='Survived')
sns.distplot(train['Fare'][train['Survived']==0], hist=False, color='orange', label='Died')
plt.legend()
sns.despine()
plt.title('Density Plot of Fare by Survived', loc='left', fontweight='bold', y=1.02)
plt.savefig('density_fare_survived.png')```

Data Exploration Summary

From my analysis of the data, a passenger had a higher probability of dying if three things were met: passenger was 1) male; 2) between 20 and 30; and 3) was in 3rd class. The majority of those who embarked from ‘S’ were likely 3rd class, as they had more passengers who died than survived. Women and children were more likely to survive. This is likely due to women and children being given preference for the life-boats. Additionally, 1st class had a higher percentage of survival than 2nd and 3rd class, and 2nd class had a higher percentage of survival than 1st class. This is likely due to the layering of the cabin floors, with 1st class being near the top of the ship, 3rd class near the bottom, and 2nd inbetween.

Machine algorithms don’t work well with categorical variables. Next, I create dummy variables for the Sex and Embarked features.

In [29]:
```# Create dummy variables of the Sex and Embarked features in the train dataset
dummy_sex = pd.get_dummies(train['Sex'], drop_first=True) # drop one sex due to collinearity
dummy_embarked = pd.get_dummies(train['Embarked'], drop_first=True)
```
In [30]:
```# Concatenate the dummy features to the train dataset; remove the Sex and Embarked columns
train = pd.concat([train, dummy_sex, dummy_embarked], axis=1)
train = train.drop(['Sex','Embarked'], axis=1)
```
Out[30]:
PassengerId Survived Pclass Age SibSp Parch Fare male Q S
0 1 0 3 22 1 0 7.2500 1 0 1
1 2 1 1 38 1 0 71.2833 0 0 0
2 3 1 3 26 0 0 7.9250 0 0 1
3 4 1 1 35 1 0 53.1000 0 0 1
4 5 0 3 35 0 0 8.0500 1 0 1
In [31]:
```# Create dummy variables of the Sex and Embarked features in the test dataset
dummy_sex = pd.get_dummies(test['Sex'], drop_first=True) # drop one sex due to collinearity
dummy_embarked = pd.get_dummies(test['Embarked'], drop_first=True)
```
In [32]:
```# Concatenate the dummy features to the test dataset; remove the Sex and Embarked columns
test = pd.concat([test, dummy_sex, dummy_embarked], axis=1)
test = test.drop(['Sex','Embarked'], axis=1)
```
Out[32]:
PassengerId Pclass Age SibSp Parch Fare male Q S
0 892 3 34 0 0 7.8292 1 1 0
1 893 3 47 1 0 7.0000 0 0 1
2 894 2 62 0 0 9.6875 1 1 0
3 895 3 27 0 0 8.6625 1 0 1
4 896 3 22 1 1 12.2875 0 0 1
In [33]:
```# Save test PassengerId, then remove PassengerId from both datasets
test_passengerids = test['PassengerId'] # I need this for the submission process
train = train.drop('PassengerId', axis=1)
test = test.drop('PassengerId', axis=1)
```

Logistic Regression Model

In [34]:
```# Import the model from sklearn
from sklearn.linear_model import LogisticRegression
```
In [35]:
```# Instantiate the logistic regression model
logmodel = LogisticRegression()
```
In [36]:
```# Fit the model
train_feat = train.drop('Survived', axis=1)
train_label = train['Survived']
logmodel.fit(train_feat, train_label)
```
Out[36]:
```LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)```
In [37]:
```# Predict survival
test_pred = logmodel.predict(test)
```
In [38]:
```test_pred = pd.Series(test_pred, name='pred')
```

Sanity Check

I would expect the survival counts/proportions in the test data to be similar to the train data. I can do a quick summary of the predictions to see if it correlates well with the train statistics.

In [39]:
```#test['pred'].value_counts()
test_pred.value_counts()
```
Out[39]:
```0    264
1    154
Name: pred, dtype: int64```
In [40]:
```#test['pred'].value_counts(normalize=True)
test_pred.value_counts(normalize=True)
```
Out[40]:
```0    0.631579
1    0.368421
Name: pred, dtype: float64```

Sanity Check Findings

The train dataset had a survival proportion of 62% died and 38% survived. My predictions are very close to this proportion, with 63% died and 37% survived. This is a good sign that my predictions have a high accuracy.

Submission

The actual Survived data for the test dataset was not provided, so I cannot test the accuracy of my model on it. I need to submit my model predictions to Kaggle in order to learn the accuracy of my predictions, assuming the train dataset was a representative sample.

In [41]:
```# Create the CSV file to submit to Kaggle
submission_dict = {
'PassengerId':test_passengerids,
'Survived':test_pred
}
submission = pd.DataFrame(submission_dict)
```
Out[41]:
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1
In [42]:
```submission.tail()
```
Out[42]:
PassengerId Survived
413 1305 0
414 1306 1
415 1307 0
416 1308 0
417 1309 0
In [43]:
```len(submission)
```
Out[43]:
`418`
In [44]:
```# Save submission to csv
submission.to_csv('submission.csv', index=False) # Remove index for proper submission to Kaggle
```

Submission Results

The accuracy of this initial submission, using a logistic regression model, was 0.75598 or 75.6%. I plan on revisiting this dataset as I continue to improve my machine learning skills with the aim of improving the accuracy of my model.