This post is part II of a walkthough of how I built and improved my submission to the Titanic Machine Learning competition on Kaggle. The goal of the competition is to create a machine learning model that predicts which passengers survived the Titanic shipwreck.
Part I covered data exploration, cleansing and transformation. At the end of my last post, we had a set of features ready to be fed into our machine learning models..
We are going to build models based on several classification algorithms and fit and test them on our training data set. Based on the models’ scores on the training set, we will select the most performing model and tune it’s hyperparameters.
We are going to explore the following classification algorithms - using scikit-learn, we will build classification models based on each of the below algorithms and compute an average cross validation score for each one.
Let’s first have a look at our features.
features = ['Sex', 'Title', 'Age_group', 'Fare_cat', 'Pclass', 'Embarked', 'Deck', 'SibSp', 'Parch', 'Relatives','Alone']
train_data[features][:5]
| Sex | Title | Age_group | Fare_cat | Pclass | Embarked | Deck | SibSp | Parch | Relatives | Alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 0 | 3 | 0 | 8 | 1 | 0 | 1 | 0 |
| 1 | 1 | 3 | 2 | 3 | 1 | 1 | 3 | 1 | 0 | 1 | 0 |
| 2 | 1 | 2 | 1 | 1 | 3 | 0 | 8 | 0 | 0 | 0 | 1 |
| 3 | 1 | 3 | 2 | 3 | 1 | 0 | 3 | 1 | 0 | 1 | 0 |
| 4 | 0 | 1 | 2 | 1 | 3 | 0 | 8 | 0 | 0 | 0 | 1 |
y = train_data["Survived"]
features = ['Sex', 'Title', 'Age_group', 'Fare_cat', 'Pclass', 'Embarked', 'Deck', 'SibSp', 'Parch', 'Relatives','Alone']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
scores = {}
cross_val_scores = {}
Some of these algorithms - such as K Nearest Neighbors and Support Vector Machine - rely on the distance between different data points in their classification process. For these algorithms, it is best to standardize all our features so they are on a similar scale. It is worth noting that our features are already on a relatively similar scale, as they are all integers ranging from 0 to 8!
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
scaled_X = sc.fit_transform(X)
scaled_X_test = sc.transform(X_test)
We will now go through all of the algorithms above and build and fit models with our training data. For each model we will compute the model’s score on the training data, as well as an average cross-validation score in order to identify the best performing model.
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
model = LogisticRegression()
model.fit(X, y)
scores['Logistic Regression'] = model.score(X, y)
cross_val_scores['Logistic Regression'] = cross_val_score(model, X, y, cv=10, scoring = "accuracy").mean()
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors = 4)
model.fit(scaled_X, y)
scores['K Nearest Neighbors'] = model.score(scaled_X, y)
cross_val_scores['K Nearest Neighbors'] = cross_val_score(model, scaled_X, y, cv=10, scoring = "accuracy").mean()
from sklearn.svm import SVC
model = SVC()
model.fit(scaled_X, y)
scores['Support Vector Machine'] = model.score(scaled_X, y)
cross_val_scores['Support Vector Machine'] = cross_val_score(model, scaled_X, y, cv=10, scoring = "accuracy").mean()
from sklearn.linear_model import Perceptron
model = Perceptron(max_iter=25)
model.fit(X, y)
scores['Perceptron'] = model.score(X, y)
cross_val_scores['Perceptron'] = cross_val_score(model, X, y, cv=10, scoring = "accuracy").mean()
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier
sgd = linear_model.SGDClassifier(max_iter=15, tol=None)
sgd.fit(X, y)
scores['Stochastic Gradient Descent'] = model.score(X, y)
cross_val_scores['Stochastic Gradient Descent'] = cross_val_score(model, X, y, cv=10, scoring = "accuracy").mean()
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X, y)
scores['Decision Tree'] = model.score(X, y)
cross_val_scores['Decision Tree'] = cross_val_score(model, X, y, cv=10, scoring = "accuracy").mean()
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
scores['Random Forest'] = model.score(X, y)
cross_val_scores['Random Forest'] = cross_val_score(model, X, y, cv=10, scoring = "accuracy").mean()
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
scores['Gaussian Naive Bayes'] = model.score(X, y)
cross_val_scores['Gaussian Naive Bayes'] = cross_val_score(model, X, y, cv=10, scoring = "accuracy").mean()
cross_val_scores = dict(sorted(cross_val_scores.items(), key=lambda item: item[1], reverse = True))
cross_val_scores_df = pd.DataFrame.from_dict(cross_val_scores, orient='index').reset_index().rename(columns = {"index" : "Model", 0 : "Cross Val Score"})
scores_df = pd.DataFrame.from_dict(scores, orient='index').reset_index().rename(columns = {"index" : "Model", 0 : "Score"})
cross_val_scores_df = cross_val_scores_df.merge(scores_df)
cross_val_scores_df
| Model | Cross Val Score | Score | |
|---|---|---|---|
| 0 | Support Vector Machine | 0.829401 | 0.845118 |
| 1 | Random Forest | 0.821610 | 0.921437 |
| 2 | K Nearest Neighbors | 0.821536 | 0.854097 |
| 3 | Logistic Regression | 0.809213 | 0.817059 |
| 4 | Decision Tree | 0.800275 | 0.921437 |
| 5 | Gaussian Naive Bayes | 0.789001 | 0.790123 |
| 6 | Perceptron | 0.708202 | 0.802469 |
| 7 | Stochastic Gradient Descent | 0.708202 | 0.802469 |
plt.figure(figsize = (16, 5))
sns.barplot(data = cross_val_scores_df, x = "Model", y = "Cross Val Score")
plt.axis([-0.5, 7.5, 0.69, 0.84])
plt.show()

The SVM, Random Forest and K Nearest Neighbors models seem to outperform other models we used both in terms of score using a single training set and using cross validation.
In the following section we will explore these 3 models and try to improve them by tuning their hyper parameters. We will then submit our predictions for each improved model and note how they score on Kaggle.
We will use scikit-learn’s Grid Search CV to perform a search to find the best parameters values for our model.
from sklearn.model_selection import GridSearchCV
param_grid = { "criterion" : ["gini", "entropy"],
"min_samples_leaf" : [1, 5, 10, 25, 50, 70],
"min_samples_split" : [4, 12, 25],
"n_estimators": [100, 400, 700, 1000],
"max_features" : [3, 5, 9, 10]}
model = RandomForestClassifier(n_estimators=100)
gridsearch = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
gridsearch.fit(X, y)
gridsearch.best_params_
{'criterion': 'gini',
'max_features': 9,
'min_samples_leaf': 1,
'min_samples_split': 12,
'n_estimators': 700}
We will now use the features returned by the search to build our model and then fit it to our training data. The Random Forst model with tuned hyper parameters has an average cross validation score of 0.824 on the training data.
rf_model = RandomForestClassifier(criterion = 'gini', max_features = 9, min_samples_leaf = 1, min_samples_split = 12,
n_estimators = 700)
rf_model.fit(X, y)
print("Score: ", round(rf_model.score(X, y), 3))
print("Average cross validation score: ", round(cross_val_score(rf_model, X, y, cv=10, scoring = "accuracy").mean(), 3))
Score: 0.884
Average cross validation score: 0.824
We predict the outputs for our test data and submit it to the competition on Kaggle. This model score 0.775 on the test data.
predictions = rf_model.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
We will now perform a search to find the best parameter values for our Support Vector Machine model.
from sklearn.model_selection import GridSearchCV
param_grid = { "kernel" : ["linear", "poly", "rbf", "sigmoid"],
"C" : [0.1, 1, 10, 50, 100],
"gamma" : ["scale", "auto"]}
model = SVC()
gridsearch = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
gridsearch.fit(scaled_X, y)
gridsearch.best_params_
{'C': 1, 'gamma': 'scale', 'kernel': 'poly'}
The SVM model with tuned hyper parameters has an average cross-validation sore of 0.828 on the training data and scored 0.787 on the test data when submitted to the competition on Kaggle.
svc_model = SVC(kernel = 'poly', gamma = "scale", C = 1)
svc_model.fit(scaled_X, y)
print("Score: ", round(svc_model.score(scaled_X, y), 3))
print("Average cross validation score: ", round(cross_val_score(svc_model, scaled_X, y, cv=10, scoring = "accuracy").mean(), 3))
predictions = svc_model.predict(scaled_X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
Score: 0.85
Average cross validation score: 0.828
from sklearn.model_selection import GridSearchCV
param_grid = { "n_neighbors" : [3, 5, 8, 10],
"weights" : ["uniform", "distance"],
"algorithm" :["auto", "ball_tree", "kd_tree", "brute"]}
model = KNeighborsClassifier(n_neighbors = 4)
gridsearch = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
gridsearch.fit(scaled_X, y)
gridsearch.best_params_
{'algorithm': 'brute', 'n_neighbors': 10, 'weights': 'uniform'}
The K Nearest Neighbors model with tuned hyper parameters has an average cross-validation sore of 0.835 on the training data and scored 0.769 on the test data when submitted to the competition on Kaggle.
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors = 10, algorithm = 'brute', weights = 'uniform' )
knn_model.fit(scaled_X, y)
print("Score: ", round(knn_model.score(scaled_X, y), 3))
print("Average cross validation score: ", round(cross_val_score(knn_model, scaled_X, y, cv=10, scoring = "accuracy").mean(), 3))
predictions = knn_model.predict(scaled_X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
Score: 0.852
Average cross validation score: 0.835
Let’s see what our models have to say on Jack and Rose’s survival. Below we built a dataframe with Jack and Rose’s features. For example, Jack’s fare category is 0 because he was a ‘poor’ artist, according to wikipedia, Rose was in first class because if her family’s upper-class status…
Jack_and_Rose = pd.DataFrame({"Sex" : [0, 1],
"Title": [1, 2],
"Age_group": [0, 0],
"Fare_cat": [0, 2],
"Pclass": [3, 1],
"Embarked": [0, 0],
"Deck": [8, 1],
"SibSp": [0, 1],
"Parch": [0, 1],
"Relatives": [0, 2],
"Alone": [1, 0]})
print(rf_model.predict(Jack_and_Rose))
print(knn_model.predict(Jack_and_Rose))
print(svc_model.predict(Jack_and_Rose))
[0 1]
[0 1]
[1 1]
Both our Random Forest and K Nearest Neighbors predicted that Rose survived, and Jack did not. Our Support Vector Machine is more optimistic and predicted both survived!
Written on March 10th, 2021