The Titanic Survival dataset | Michèle Srour

Michèle Srour Data Scientist

The Titanic Survival dataset

This post is part I of a walkthough of how I built and improved my submission to the Titanic Machine Learning competition on Kaggle. The goal of the competition is to create a machine learning model that predicts which passengers survived the Titanic shipwreck.

In this post and the next, I will walk through the process of creating a machine learning classification model using the Titanic dataset, which provides various information on the passengers of the Titanic and their survival.

Part I covers data exploration, cleansing and transformation. At the end of this post, we’ll have a set of features ready to be fed into our machine learning models.

The dataset can be found on Kaggle. It is split into two group, the training set (train.csv) and the test set (test.csv).

The training set, which is meant to be used to build machine learning models, comes with the outcome (survived or not) for each passenger. The test set has the same features as the training set, apart from the outcome.

Loading and exploring the data


import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

The output of the cell below shows the first few lines of our training set. The dataset contains the following variables :

as well as the outcome ‘Survived’.

train_data = pd.read_csv('train.csv')
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
test_data = pd.read_csv('test.csv')
test_data.head()
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

Using panda’s describe function shows us that 38% of passengers in the training data survived the Titanic. We can also see that the passengers’ ages range from 0.4 to 80. We can see some features with missing data, such as “Age”, “Cabin” and “Embarked”.

train_data.describe(include = "all")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Sadlier, Mr. Matthew male NaN NaN NaN CA. 2343 NaN C23 C25 C27 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN

Exploring correlation between different features of our data set and “Survival”


In this section, we explore every feature in our training set and check whether it is correlated to survival. This will help us decide whether to use it as a feature in our machine learning model, and whether we need to transform it beforehand.

We will start exploring the gender and age features - it is common knowledge that women and children were evacuated first from the sinking ship.

Gender

The below bar plot shows that around 75% of women on board survived, whereas only 19% of men did! It looks like women had a lot more chance of surviving the shipwreck, so gender would be a helpful features to predict surival.

women = train_data[train_data['Sex'] == "female"]
perc_women_survived = round(women['Survived'].sum()/len(women)*100)

men = train_data[train_data['Sex'] == "male"]
perc_men_survived = round(men['Survived'].sum()/len(men)*100)

title = f"{perc_women_survived}% of women passengers survived, \nwhereas only {perc_men_survived}% of men did"

sns.set_palette('Pastel1')
sns.set_style('whitegrid')
sns.barplot(data = train_data, x = "Sex", y = "Survived")
plt.title(title)
plt.show()

png

Age

The figure below shows the spead of age for passengers who survived and those who did not, for both men and women. We can see that the spread is quite different for men and women. Also, it looks like there are more children in the group who survived, and more older people in the group who did not survive.

ax = sns.violinplot(data = train_data, y = "Survived", x = "Age", hue = "Sex", orient= "h", split = True)

png

Passenger class

If you’ve seen the Titanic movie, you know that socioeconomic status played a role in deciding which passengers were given the priority to evacuate the ship. Survival was not only based on gender or age but also class as can be seen on the bar plot below: more than 60% of first class passngers survived the shipwreck, and no more than 25% of third class passengers did.

ax = sns.barplot(data = train_data, x = "Pclass", y = "Survived")

png

Fare

The price a passenger paid for their ticket is also an indicator of their socioeconomic status. The violin plot below shows that more people paid a higher rate for their ticket in the group who survived. Most fare amounts are on the lower side of the x axis, this is probably due to a small number of outliers.

plt.figure(figsize = (10, 6))
ax = sns.violinplot(data = train_data, y = "Survived", x = "Fare", orient= "h")

png

Relatives

The dataset provides information on the number of siblings, spouses, children and parents accompanying each passenger. Let us explore whether the number of relatives a passenger had on board affected their chance of survival. The bar plot shows that passengers accompanied by up to 3 relatives had a higher chance of survival than passengers travelling alone. The chance of survival decreases beyond this point.

relatives = train_data["SibSp"] + train_data["Parch"]

plt.figure(figsize = (12, 6))
ax = sns.barplot(x= relatives, y = train_data["Survived"])
plt.xlabel("Number of relatives")
Text(0.5, 0, 'Number of relatives')

png

Port of embarkation

One of the variables provided in the dataset in the port of embarkation “Embarked”, which takes one of three values: Cherbourg, Queenstown or Southampton. It looks like passengers who embarked from Cherbourg had a higher rate of survivals.

ax = sns.barplot(data = train_data, x = "Embarked", y = "Survived")

png

So far, we’ve explored all variables aside from PassengerId, Name, Ticket, and Cabin.

Filling out missing data


Let’s have a look at the missing data in each of the data sets. It looks like the “Cabin” variable has a huge number of missing entries in both datasets. “Age” comes next with quite a few missing values. We also have a couple of missing values for “Embarked” and “Fare”.

In this section we are going to deal with the missing data, discard the data we don’t need, and fill out the missing values with sensible data where relevant.

We are going to create an array containing both data sets, training and test, so we can perform the same operations on both.

pd.concat([train_data.isnull().sum(), test_data.isnull().sum()], axis=1).rename(columns = {0 : "Train_data", 1: "Test_data"})
Train_data Test_data
PassengerId 0 0.0
Survived 0 NaN
Pclass 0 0.0
Name 0 0.0
Sex 0 0.0
Age 177 86.0
SibSp 0 0.0
Parch 0 0.0
Ticket 0 0.0
Fare 0 1.0
Cabin 687 327.0
Embarked 2 0.0
combined_data = [train_data, test_data]

Age

Let’s start with age. We are going to compute the mean age for every gender and class combination, and use these values to fill out missing “Age” data - this will be stored in a new column “Age_full”.

for data_set in combined_data: 
    mean_age_age_class = data_set.groupby(["Sex", "Pclass"])["Age"].mean()
 
    data_set['Age_full'] = data_set.apply(lambda row : row['Age'] if not pd.isnull(row["Age"]) else   
                                        mean_age_age_class[row['Sex']][row['Pclass']] , axis = 1)
    data_set.drop(['Age'], axis = 1)
    
train_data.head(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_full
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 22.000000
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 38.000000
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 26.000000
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 35.000000
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 35.000000
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 26.507589
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 54.000000
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S 2.000000
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S 27.000000
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C 14.000000

Port of embarkation

We have a couple missing “Embarked” values in the training set. We are going to fill them out with the most common value, which happens to be “S” for Southampton.

for data_set in combined_data: 
   embarked_mode = data_set['Embarked'].mode()
   data_set['Embarked'] = data_set['Embarked'].fillna(embarked_mode[0])

Fare

We have one missing “Fare” value in the test set for a third class passenger who embarked at Southampton, which we are going to replace with the average fare for third class passengers who embarked there in the test set.

mean_fare_port_class = test_data.groupby(["Embarked", "Pclass"])["Fare"].mean()

test_data['Fare'] = test_data['Fare'].fillna(mean_fare_port_class["S"][3])

Cabin

As mentioned above, the “Cabin” variable has tons of missing values. We could completely drop this feature, however, it looks like cabin numbers have a letter which could be the deck number, or particular sections of the ship. Let’s extract this info and store it in a new “Deck” variable.

deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8, "T": 9}

for data_set in combined_data:
    data_set['Cabin'] = data_set['Cabin'].fillna("U0")
    data_set['Deck'] = data_set['Cabin'].apply(lambda x: x[0])
    
    data_set['Deck'] = data_set['Deck'].map(deck)
    data_set['Deck'] = data_set['Deck'].astype(int)

Transforming variables


Now that we are done dealing with missing data, we have 2 new columns: “Age_full” which replaces “Age”, and “Deck” that we’ve extracted from the “Cabin” data.

In the following section, we are going to transform the features we are using for our model into integer variables to feed into our models.

Numerical variables such as “Age” and “Fare” will be transformed into categorical variables by splitting the data in groups/intervals. We will then transform all our categorical variables into integer variables.

Age

Let’s start by splitting passengers into age groups and see the survival rate of different age groups. It looks like children (0-18) is the age group with the highest survival rate, whereas 70+ year olds have a survival rate around 15%.

for data_set in combined_data:
    data_set['Age_group'] = data_set['Age_full'].map(lambda x: 0 if (x >= 0) & (x < 18) else 
                                                 (1 if (x >= 18) & (x < 30) else (
                                                  2 if (x >= 30) & (x < 40) else (
                                                  3 if (x >= 40) & (x < 50) else (
                                                  4 if (x >= 50) & (x < 60) else (
                                                  5 if (x >= 60) & (x < 70) else (
                                                  6 if x >= 70 else "NaN")))))))
df = pd.DataFrame((train_data.groupby(["Age_group"])["Survived"].sum())*100/(train_data.groupby(["Age_group"])["Survived"].count()))

df['Not Survived'] = df["Survived"].apply(lambda x : 100 - x)

ax = plt.subplot()
plt.bar(range(len(df)), df["Survived"], label = "Survived")
plt.bar(range(len(df)), df["Not Survived"], bottom = df["Survived"], label = "Not Survived", alpha = 0.3)
plt.legend()
ax.set_xticks(range(0, 7))
ax.set_xticklabels(['0-18', '18-30', '30-40', '40-50', '50-60', '60-70', '70+'])
ax.set_yticks(range(0, 105, 10))
ax.set_yticklabels(str(x) + "%" for x in range(0, 105, 10))
plt.title('Percentage of survival in each age group')
plt.xlabel('Age groups')
plt.show()

png

Rate

We will now create a categorical variable out of the “Rate” variable. Using Pandas qcut, we’ll split rate values into 4 rate categories, each corresponding to a quartile of the variable. The bar plot shows that the higher the rate category (and hence the rate), the higher the survival rate.

fare_labels = [0, 1, 2, 3]

for data_set in combined_data:
    data_set['Fare_cat'] = pd.qcut(data_set['Fare'], 4, fare_labels)
    data_set['Fare_cat'] = data_set['Fare_cat'].astype(int)
    
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_full Deck Age_group Fare_cat
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 U0 S 22.0 8 1 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 38.0 3 2 3
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 U0 S 26.0 8 1 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 35.0 3 2 3
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 U0 S 35.0 8 2 1
df = pd.DataFrame((train_data.groupby(["Fare_cat"])["Survived"].sum())*100/(train_data.groupby(["Fare_cat"])["Survived"].count()))

df['Not Survived'] = df["Survived"].apply(lambda x : 100 - x)

ax = plt.subplot()
plt.bar(range(len(df)), df["Survived"], label = "Survived")
plt.bar(range(len(df)), df["Not Survived"], bottom = df["Survived"], label = "Not Survived", alpha = 0.3)
plt.legend()
ax.set_xticks(range(0, 4))
ax.set_xticklabels(['Cat1', 'Cat2', 'Cat3', 'Cat4'])
ax.set_yticks(range(0, 105, 10))
ax.set_yticklabels(str(x) + "%" for x in range(0, 105, 10))
plt.title('Percentage of survival in each fare category')
plt.xlabel('Fare categories')
plt.show()

png

Names and titles

We have not explored yet the “Name” feature. As mentioned earlier, a passenger’s name is a unique value for each passenger and is hence not likely to give any indications about their survival. However, if we look closer, this variable also holds the passenger’s title.

The value counts of the extracted titles show that the most common titles are “Mr”, “Miss”, “Mrs” and Master”. Other titles are quite rare and will be all grouped under one fifth category.

for data_set in combined_data:
    data_set['Title'] = data_set['Name'].str.extract(' ([A-Za-z]+)\.')

df = pd.concat([train_data, test_data])

pd.DataFrame(df['Title'].value_counts()).reset_index()
index Title
0 Mr 757
1 Miss 260
2 Mrs 197
3 Master 61
4 Rev 8
5 Dr 8
6 Col 4
7 Ms 2
8 Major 2
9 Mlle 2
10 Jonkheer 1
11 Countess 1
12 Dona 1
13 Capt 1
14 Don 1
15 Sir 1
16 Lady 1
17 Mme 1
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}


for data_set in combined_data:
    data_set['Title'] = data_set['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr','Major', 'Rev', 'Sir', 'Dona', 'Jonkheer'], 'Rare')
    data_set['Title'] = data_set['Title'].replace(['Mlle', 'Ms'], 'Miss')
    data_set['Title'] = data_set['Title'].replace('Mme', 'Mrs')
 
    data_set['Title'] = data_set['Title'].map(titles)
    
    data_set['Title'] = data_set['Title'].fillna(0)
    data_set['Title'] = data_set['Title'].astype(int)
    
    data_set = data_set.drop(['Name'], axis=1)

Relatives, gender and ports of embarkation

In the below we create a new column “Relatives” which represents the total number of siblings, spouses, parents and children accompagniying a passenger, as well as a variable “Alone” which has either a value of 1 if the passenger had no relatives accompagnying them, or 0 if they did. We also map string values in columns “Sex” and “Embarked” to numerical values.

for data_set in combined_data:
    data_set['Relatives'] = data_set['SibSp'] + data_set['Parch']
    data_set['Alone'] = data_set['Relatives'].apply(lambda x : 1 if x == 0 else 0) 
gender = {'female' : 1, 'male': 0}

for data_set in combined_data:
    data_set['Sex'] = data_set['Sex'].map(gender)
ports = {'S' : 0, 'C': 1, 'Q' : 2}

for data_set in combined_data:
    data_set['Embarked'] = data_set['Embarked'].map(ports)
    data_set['Embarked'] = data_set['Embarked'].astype(int)
    
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_full Deck Age_group Fare_cat Title Relatives Alone
0 1 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 U0 0 22.0 8 1 0 1 1 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 1 38.0 3 2 3 3 1 0
2 3 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 U0 0 26.0 8 1 1 2 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 0 35.0 3 2 3 3 1 0
4 5 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 U0 0 35.0 8 2 1 1 0 1

We are now ready to build our models using the following features which all have integer values ‘Sex’, ‘Title’, ‘Age_group’, ‘Fare_cat’, ‘Pclass’, ‘Embarked’, ‘Deck’, ‘SibSp’, ‘Parch’, ‘Relatives’ and ‘Alone’. Stay tuned for part II of the Legendary Titanic where we’ll built different models and create a submission.