Tackling the Titanic Problem

I’m not sure how often I’ve said to my self and others that I would learn more about data science and start tackling problems and learning algorithms. Some, I would then proceed to start watching a tutorial which I’d quickly lose interest in, and usually, I would do nothing at all. In a couple of my more successful attempts, I’d tackle the Iris dataset and use a Nearest Neighbors algorithm (the only one I had any familiarity with) to predict it and stop there. Part of the reason for this was that, outside of tutorials I wasn’t really sure how to apply anything I’d learn and would quickly lose motivation.

That’s where Kaggle comes in (hopefully). From time to time I’d hear about Kaggle competitions, where people could compete to find the best algorithm to solve a problem. A competition aspect to something I didn’t know seemed a bit intimidating, but when I was hit with the inspiration once again to tackle the vague idea of “Machine Learning”, I decided to set doing well in a Kaggle competition as a goal, expecting that having some goal would make for a more successful approach. Kaggle also has beginner focused “competitions” without set end dates, so my approach will be to tackle a few of those and figuring out my mistakes on problems that many have tackled already, before trying a regular competition.

So far in my approach to learning data science, I’ve read Andreas Muller’s Intro to Machine Learning with Python book, so pretty much all of my current understanding on Algorithms come from from that book. I found it to be the best source for explaining concepts in a way that someone without much of a math background (me) could understand.

Approaching The Titanic

First, I’ll try tackling is Kaggle’s Titanic dataset and predict whether or not a passenger would survive the Titanic based on 9 given features. I’ll record my approaches until I’ve done the best I can with the current information, at which point I’ll see what more experienced data scientists would have done and what I’m missing in my approach. Additionally, for this challenge I’ll be using Scikit-learn only. At the moment, I don’t know how much (if any) or a hindrance this will be, but I’ve got to start somewhere.

I’ll attempt to try and compare (my attempts at implementing) the following kinds of algorithms:

Nearest Neighbors
Linear
Decision Trees (Gradient Boosted and Random Forest)
SVMs

The Data

The data on the titanic for training consists of 891 passengers along including 12 features:

PassengerId
Survived
PClass – Passenger Class
Name
Sex
Age – Pass
SibSp – Number of siblings and spouses
Parch – Number of parents and children
Ticket – Ticket number
Fare – How much they paid
Cabin
Embarked – Where they embarked: C, S or Q

The test set of 418 passengers contains these columns except for “Survived”, which is what we’re trying to predict.

Cleaning The Data

Before fitting training data to a model, there are a few changes and considerations to be made. First, the data contains null values which causes models to break. Second, the categorical data needs to be encoded since the models can’t read text. Third, some of the columns should probably be removed since it’s unlikely the models can do anything with names and ids which unique to each passenger.

To start, I’ve separated the features in to three categories: ones that are model-ready, ones that need to be encoded and ones that aren’t needed at all.

[code lang=text]
df_raw = pd.read_csv('train.csv')
safe_columns = ["Survived", "Age", "SibSp", "Parch", "Fare"]
need_conversion = ['Sex', "Embarked", "Pclass"]
assumed_irrelevant = ["Name", "Ticket", "PassengerId", "Cabin"]
[/code]

The safe columns are all ones with continuous values (or binary for “Survived”), with the ones needing conversion being categorical. In the case of the columns I’m dropping, each seems unique to each customer which would lead to a large number of encoded columns that are only relevant on the training data and I’m expecting would lead to over-fitting. My next steps are to drop the unneeded needed columns and encode the categorical columns.

[code lang=text]
df = df_raw.drop(["Name", "Ticket", "PassengerId", "Cabin"], axis=1)
df = pd.get_dummies(df, columns=need_conversion)
[/code]

Next, to handle any null values, I’ll use ScikitLearn’s Simple Imputer to create reasonable false values.

[code lang=text]
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
new_df = pd.DataFrame(my_imputer.fit_transform(df))
[/code]

Lastly, in order to handle any outliers, I passed the data through ScikitLearn’s MinMaxScaler, which was a step I added as the first step in the pipeline for each algorithm’s process.

Testing The Algorithms
Initially, I created instances of each algorithm type mentioned and ran them on their own, then with cross validation and a grid search, and finally in the following pipeline, differing with only the algorithm and parameter grid.

[code lang=text]
param_grid = {'ex_param1': [0.001, 0.01, 0.1, 1, 10, 100],
'ex__param2': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = Pipeline([("scaler", MinMaxScaler()), ("ex", ExModel())])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
[/code]

X_train and y_train come from the following split

[code lang=text]
X_train, X_test, y_train, y_test = train_test_split(new_df[cols], new_df[0], random_state=0)
[/code]

KNN
Despite being the first algorithm I learned and the only one I was familiar with the past year, the nearest neighbors algorithm would consistently perform worse on the testing score, with only the n_neighbors parameter being tuned.

[code lang=text]
param_grid = {"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
pipe = Pipeline([("scaler", MinMaxScaler()), ("knn", KNeighborsClassifier())])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
[/code]

Training Set accuracy: 0.81
Test Set accuracy: 0.79

SVM
Next I tried ScikitLearn’s SVM classifier which performed a little better, tuning the C and gamma parameters.

[code lang=text]
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
[/code]

Training Set accuracy: 0.82
Test Set accuracy: 0.81

Linear
For linear, I tried Logistic Regression and LinearSVC functions, tuning the C parameter.

[code lang=text]
param_grid = {'lr__C': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = Pipeline([("scaler", MinMaxScaler()), ("lr", LogisticRegression())])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
[/code]

The Logistic Regression performed slightly worse than the SVM function.

Training Set accuracy: 0.80
Test Set accuracy: 82

While the LinearSVC function performed slightly worse than the Logistic Regression.

[code lang=text]
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", LinearSVC())])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
[/code]

Training Set accuracy: 0.80
Test Set accuracy: 79

Decision Trees
From running single decision trees on the dataset, it already looked to be a successful method, scoring with 80 percent accuracy, higher than most of the single iterations of algorithms I was running at the time without cross validation, and the random forests and gradient boosted trees I ran the pipeline had better results as well.

First, off the Random Forest classifier using n_estimators as the only parameter in the param grid. While changing parameters with grid search seems like a good strategy in general, I think this might be a parameter that it’s wrong to use grid search on, since my understanding is that more estimators will always be better and the trade off is computational power.

[code lang=text]
param_grid = {'rf__n_estimators': [50, 75, 100, 200, 300]}
pipe = Pipeline([("scaler", MinMaxScaler()), ("rf", RandomForestClassifier())])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
[/code]

Chosen value of N_estimators: 200
Training Set accuracy: 0.80
Test Set accuracy: 0.83

Going with the assumption that running a grid search on n_estimators is unnecessary since the trade off is between cost to run and accuracy as opposed to between over-fitting and generalizing, I decided to fix n_estimators at 300 to run the Gradient Boosting Classifier, using the grid search to tune the learning rate and max depth.

[code lang=text]
param_grid = {'gb__n_estimators': [300], 'gb__learning_rate': [0.001, 0.01, 0.1, 1, 10, 100], 'gb__max_depth': [2, 3, 4, 5, 6, 7, 8]}
pipe = Pipeline([("scaler", MinMaxScaler()), ("gb", GradientBoostingClassifier())])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
[/code]

Training Set accuracy: 0.83
Test Set accuracy: 0.83

Lastly, I decided to try out an algorithm called XGBoost, that was mentioned to be quite effective even on its own. I tuned the eta (learning rate) and gamma parameters and ultimately got the best results of anything put through the pipeline on the training set.

[code lang=text]
from xgboost import XGBClassifier
param_grid = {'xg__eta': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1], 'xg__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = Pipeline([("scaler", MinMaxScaler()), ("xg", XGBClassifier())])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
[/code]

Training Set accuracy: 0.82
Test Set accuracy: 0.85

Submitting the Results
After running the algorithms on the test set from my split, I chose some of the stronger ones to submit to Kaggle and check the results on their test set. This process involved saving the grid_search object’s best_estimator_ parameter to a variable,

[code lang=text]
best_esimator = gs.best_estimator_
[/code]

loading the test data and running the same preprocessing on it

[code lang=text]
testset = pd.read_csv("test.csv")
exid = testset['PassengerId']
tdf = testset.drop(["Name", "Ticket", "PassengerId", "Cabin"], axis=1)
tdf = pd.get_dummies(tdf, columns=need_conversion)
imputer_the_second = pd.DataFrame(SimpleImputer().fit_transform(tdf), columns=tdf.columns)
[/code]

then removing the floats from the output, so they’re binary results and fitting the best_estimator_ with the new data, and outputting the predicted results to a csv file containing the PassengerId.

[code lang=text]
pd.options.display.float_format = '{:,.0f}'.format
result = best_estimator.predict(imputer_the_second)
tdf['PassengerId'] = exid
tdf['Survived'] = result
endcols = ['PassengerId', 'Survived']
enddf = tdf[endcols]
enddf.to_csv('firstsubmissionx.csv', index=False)
[/code]

For Kaggle’s scoring, my submissions got the following values:

Random Forest – .76555
SVC – .77990
Gradient Boosted Trees – 0.79425,
Logistic Regression – 0.75598
XGboost – 0.77033

The gradient boosted tree ended up getting the highest score on the test set, despite the XGBoost appearing to be the best in testing.

Takeaways
Overall, this felt like a good process to go through for getting started, although I still have questions about my results and how successful this attempt was. Part of it is due to being new and having a small understanding of each of the algorithms I used. I can understand the reasons for using a random forest over a single decision tree, and using a grid search and cross validation over single examples, but am confused as to why the trees seemed more effective than a linear model, and with the SVC model being so close in percentage, how much was just luck or parameter tuning.

Speaking of luck, while cross validation could split up the training and test sets in different ways, for Kaggle, I was only evaluating one data set. There seem to be some people who posted 100 percents, but I don’t even know if I should believe an algorithm could consistently predict that from the data given. If I could receive multiple sets to test on, and average them, would the results be different. Should I trust my test results over the Kaggle score or vice versa?

From here, I’ll take a look at how others approached the problem and see if there’s an accepted solution for what I should have done. My next step will be either writing an article looking back over the process or beginning on Kaggle’s House Prices dataset to try my hand at a regression problem.

One response to “Tackling the Titanic Problem”

Predicting Housing Values in Iowa – ProjectIDK

March 19, 2019 at 5:37 pm

[…] of various types and finally try my hand on an actual one. Last time, I took I tried my hand at the Titanic competition which was a classification problem, so this time I’ll be looking at the Ames […]

Loading…

NikCreate

Leave a ReplyCancel reply

From Linux to Lambda: The Evolutions of Compute – Containerization

NickLiftWeight: A Tracker for Weight Lifting

Nick Ate: Food Logging

Trending

From Linux to Lambda: The Evolutions of Compute – Containerization

NickLiftWeight: A Tracker for Weight Lifting

Nick Ate: Food Logging

From Linux to Lambda: The Evolutions of Compute – Virtualization