Blog Archives

tutorial to approach the titanic dataset

9/16/2019

In this tutorial, you will learn how to complete the analysis of what sorts of people were likely to survive the Titanic shipwreck. In particular, you are going to apply the tools of machine learning to predict which passengers survived the tragedy.

Description of Dataset
      The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
     One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
     The steps we are going to follow to predict which passengers survived the tragedy:
1. Import data
2. Analyse the data
3. Data cleaning
4. Feature Selection
5. Model selection and training

1] Importing the Data

Python Notebook

   Let us look at how we are going to load your dataset into a python file. You are probably using a Python IDE like IDLE to run your .py files. However, as seriousTM Machine Learning enthusiasts, we are going to use something that professionals use: Jupyter Notebooks.

   The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. They’re used for data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more.

    If you want to load the data right now and start working with it, instead of having to install Python and then Jupyter Notebooks, no fear! You can also use Kaggle Notebooks!
   The screenshots below will help you open up a Kaggle Notebook in your browser. To install Jupyter Notebooks, follow the installation guide given on https://jupyter.org/ .

Adding the data

We are now going to add dataset that we are going to use into the Python Notebook.

    If you notice the dataset comes in two .csv files, training and test data. If you attended our session “How To: ML” by Ravina More, she explained the importance of splitting your dataset into training and test data. If you couldn’t attend the session, read this small article on it.

     Splitting your data is essential because you want to make sure your data is working correctly. But why not train your ML algorithm on all the data and test it on the same data? To have a real-world estimate of the accuracy of your model, we need to test it on data that it has never seen before. This is why we separate the data into training and test data.
     What do you think is the ratio of the split? Is it 50-50? Or is it 70-30?

    There exists another .csv file called gender_submission.csv which is a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like. Ignore that file for the time being.

Loading data in Python

     Let us now load our dataset into our Python Notebook. We are going to use a Python library called Pandas. Pandas is the most popular Python library that is used for data analysis. We are going to be using a data structure in Pandas called DataFrames. Don’t worry about whether you know exactly what that is or not. Go ahead and look at the basics of Pandas from this site.
Instead of spending all our time going through data frames and how to manipulate them, we’re going to jump right in and start working with it.
       First, import the pandas library with this statement.
       import pandas as pd

      Now let us look at how to load a .csv file in pandas. Google and search for the function to do this and save your loaded .csv files into two variables, train and test.
      train_df = #your code here
      test_df = #your code here

Note: If you are getting an error in this step, consider that it might be because your filenames are wrong. If you run the lines underlined in red in the last screenshot, you will see an output that is a list of 3 file paths. Copy the relevant file paths into the function that you are using to load the data and it should work fine.

2] Analyse the Data

Looking at the Data

       We shall now look closer at the dataset that we have just loaded to see what the data looks like. We have to figure out what we need to do with the data, what sort of algorithm to use, etc but we cannot do that until we can see what the data itself looks like.

     To do this we will analyse the whole dataset and for which we will temporarily combine the training and test data frames in a new variable called combine. We will not be performing our training on this dataset as we said before.
          combine = [train_df, test_df]

     We can also perform this analysis on the training data (train_df) only as well, but let us analyse the entire dataset.

Analysis of the Data

    While analysing the dataset, let us try and answer the following questions:
    1. Which features are available in the dataset?.
    2. Which features are categorical?
    3.Which features are numerical?
    4. Which features are mixed data types?
    5. Which features may contain errors or typos?
    6. Which features contain blank, null or empty values?
    7. What are the data types for various features?
    8. What is the distribution of numerical feature values across the samples?
    9. What is the distribution of categorical features?

1. Which features are available in the dataset?
   Let us print the entire data frame to see what it looks like. As you know already, to print a value in python, we have to just type the name of the variable on a new line. Hence,
            train_df
will print the dataset and you will see the entire dataset at a single glance.

    The features are the column names shown in bold (PassengerId, Survived, Pclass, Name, etc). These are called the features of the dataset and, as seen, is defined for every row in the dataset. Hence, each row is every passenger on the Titanic, or formally, every row is a sample in the dataset and each sample has features which are common across the dataset. NaN stands for not a number and can be interpreted as a value that is undefined or unrepresentable.

    Google and search for a command that shows the columns in a data frame. Your code should print out all the column names: 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'.

There is an additional function of data frames that can be used to display a few samples in the dataset. Which function is this?

2. Which features are categorical?
   In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed number of possible values. For example, sex (not gender) can be of two types, male and female. A variable that can only take on values “A” or “B” or True or False, are also examples of categorical data. For this exercise, simply look through the dataset that you have printed and look at the values for each feature and note down the ones which you think are categorical.

3. Which features are numerical?
    Numerical data is data that take on numeric values. Note down the features that you think have numeric data.

4. Which features are mixed data types?
    Mixed data is a mix of numeric and alphanumeric data or simply alphanumeric data in the same feature. Note down the features that you think have mixed data type.

5. Which features may contain errors or typos?
    Think about this question as something to do with human error. Perhaps the person recording all the people who embarked on the Titanic was hard of hearing and sometimes wrote down the wrong thing. Which feature can be affected by this?

6. Which features contain blank, null or empty values?
   This information will be helpful to us when we are correcting the data. Look through the data for such missing values and note these features down.

7. What are the data types for various features?
   This can be useful when handling the data, for example, when adding in missing values, etc. We might say that the categorical, numeric and mixed data types are already something we know so why not use that data? However, here we are looking for specific Python data types.
   Google and search for ways to get columnar data types for data frames. To check your answer, the data type of PassengerId is int64.

8. What is the distribution of numerical feature values across the samples?
    Here we try to find various statistical measures in our features that can help us determine if the training data is actually representative of the problem statement that is trying to be solved. If it isn’t, our predictions will not perform well on the test data.

    We can apply several statistical measures like count, mean, standard deviation, min and max. We could spend our time writing code to go through each column of the data frame and returning the mean of that column, but that would be inefficient. The reason we are using pandas is because of its fantastic libraries. Try to find the function that would describe the distribution of numerical feature values for a data frame in pandas. It’s a single function that will return the values of count, mean, standard deviation, min and max, among others, for each column (also called feature) of the dataset.

9. What is the distribution of categorical features?
   If you noticed, the function used above is called on the entire dataset, but only returns count, etc for numeric data. To make it return categorical feature analysis as well include this in the parameter of the function.
include=['O']

3] Data Cleaning

Missing Values

We have seen that a number of features have missing values. The features that have null values are Age, Embarked and Cabin. Let us now see how we can complete these features.
The number of missing values for each of these features can be found out by running the code given below. Find the number of missing values for each of the columns with missing values.

print('Percent of missing 'column name' records is %.2f%%' %((train_df['column name'].isnull().sum()/train_df.shape[0])*100))

   Age Feature

    One way to complete a feature is by assigning all the null values to randomly generated values. But that would introduce outliers (samples do not follow the pattern in the data and can cause us to make wrong predictions). So don’t do that.

     Another thing we can do is replace the missing values with the mean of the data sample ages that is already there. This would not change the mean or affect the dataset or pattern in the data greatly, so we go ahead with it.
    Let us now write the code for this:

import math
age_mean = #your code here
num_rows = #your code here
for i in range(num_rows): #basic for loop
    if math.isnan(#your code here):
        #your code here = age_mean
train_df #to see the updated dataframe

    The functions that have been used for it can be found out by googling what you want to find and looking through documentation. Search for how you can find the number of rows in a dataframe, how to get the mean of a dataframe, or how to get the mean of a particular column, etc.

     Embarked Feature

     The embarked feature is a categorical feature that takes values either S, Q or C depending upon the port of embarkation. In the Age feature, we have replaced NaN values with the mean value. When we check the number of NaN values for the embarked feature, we see that it is quite less, 0.22% (Go ahead and check if you haven’t already.) Hence we can simply replace it with the most frequent value, ie, the port from which most people embark, as this would not change any pattern that the data is currently showing. If you remember correctly, this is another sort of measure called mode, ie, the most frequently occurring data sample.

    As in the above example we can use a loop to go through the “Embarked” column and if the value is NaN, we can replace it with the mode. However, there are better and more efficient ways to go about this. There is a function to fill only the NaN values in a dataset with a particular value (or set of values). Let us use that now instead.

train_df["Embarked"].
#your code here(train_df["Embarked"].value_counts().idxmax(), inplace=True)

    If you go through the dataset now by printing it, you will see that values that were previously NaN are now filled up with the mode of ‘Embarked’.

    Cabin Feature

     Now let us look at the Cabin NaN values. When we look at the number of values that are missing in the Cabin we feature, we see that it is quite high, greater than 50%. It would be better to simply drop the column entirely than trying to come up with randomly generated values for these missing values.
    This is because if we fill it in with random values we may accidentally create or introduce some sort of pattern in the data that isn’t there or might increase variance. If this column is so incomplete, our ‘Survived’ column, ie, the one which says which passengers survived the Titanic disaster, probably does not depend on the ‘Cabin’ feature anyways and so it would be safe to drop it from our dataset entirely. Search how we drop columns in dataframes.

train_df.#your code here(#your code here, axis=1, inplace=True)

     Now that we have cleaned our data, let us ensure that there are no missing values in the dataset, incase there is some feature with missing values that we have missed. Running the following code will show us if there are any more missing values in the dataset.

train_df.isnull().sum()

       You should get a column on zero’s. Now print the train_df to look at your cleaned dataset!
Categorical to Numeric Data  Machine Learning models can categorize numeric data very well but often finds it hard to do the same with categorical data. This is because categorical data usually has words to represent a certain thing like ‘M’ for male and ‘F’ for female in the Sex column. Having categorical data in your model will cause an error message like “Your dataset has categorical data while your model requires numeric data” to pop up. This is why we need to convert our categorical data into numeric data.

     To do this we can simply, assign a number to each label, ie, the Sex column can be transformed into a rows of 1’s and 2’s with ‘F’ being assigned 1 and ‘M’ being assigned 2. However, it would be better to simply make a column, ie, you only need to know if a passenger is female or not. There is no need of a male or a combined Sex column. Thus, you transform the ‘Sex’ column into a binary column of 1’s and 0’s indicating if the Sex of the passenger is female or not.

    Similarly a column like PClass with 3 levels of classes can be split into 3 sets of columns, PClass_1, PClass_2 and PClass_3 which will have binary columns each, answering the question, “was this passenger in PClass 1?”

    This can be done with a function called get_dummies from the pandas library. Think about the feature columns that need to be transformed to numeric data (there are 3) and run the following code. You should also drop 'Sex_male' column since we only need one sex inticator. You already know how to drop columns.

train_df = pd.get_dummies(train_df,columns=[#your code here])
#drop the 'Sex_female' column here
train_df

You will now see your dataset looks like this:

You see that there are Pclass has been replaced with 3 new columns of Pclass_1, PClass_2 and PClass3 and the same for Embarked.

You have to apply these same transformations to the test dataset as well. This is because your model will be trained on a specific set of features and will expect the same features when testing the data. So apply all the data cleaning and converting categorical values into numerical values to the test dataset as well.

4] Feature Selection

Dropping Features

    Let us now think about the features that we have in our dataset. There are 3 features that probably don’t affect our Survival rate. For example, “Name” would not affect if I survived the Titanic Disaster. Rose didn’t survive because of her name and neither did Jack die because of his name. We could probably drop the entire column and it would be fine. Figure out the two other features and drop them as well. You have already seen how to drop features.

Adding Features

    Both SibSp and Parch relate to traveling with family. For simplicity's sake (and so that the same feature isn’t ‘counted’ twice, ie, multicollinearity), let us combine the effect of these two variables into one feature that says whether or not that individual was traveling alone.

train_df['TravelAlone']=np.where((train_df["SibSp"]+train_df["Parch"])>0, 0, 1)

   Now drop the ‘SibSp’ and ‘Parch’ features from the dataset.

   You have to apply these same transformations to the test dataset as well. This is because your model will be trained on a specific set of features and will expect the same features when testing the data. So apply all the data cleaning and converting categorical values into numerical values to the test dataset as well.

NOTE: Check for the missing values in the test_data carefully. There might be new features with missing values, in which case think about what has to be done.

Now your data has been cleaned and we can move on to model selection, training and testing.

5] Model selection and training

While analysing our data we have figured out that given a sample person’s data, we need to predict if he survives or does not survive the Titanic disaster. Notice that we aren’t being told to predict the probability percentage that he will survive the disaster. We simply have to say “YES” or “NO” to the question, “Did he survive the Titanic disaster”.

There are various types of Machine Learning Algorithms but the algorithm that can be used here would belong to a class of algorithms called classification algorithms. Classification algorithms are used to answer Yes/No questions. Questions like: “Is this a human?”, “Is this a pen?”, and “Did this person survive the Titanic?”

There are a number of Classification Algorithms we can use including:

Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forest

     You might be wondering how you will be able to implement these algorithms if you haven’t even heard their names before and especially if AICVS hasn’t even conducted any sessions on it. The beauty of Machine Learning and Python is that you don’t need to know the details of implementation to use an algorithm. Think of it like using a programming language. You learn a programming language like Java by learning how to write if-else statements, then how to write functions and classes and create objects, and so on. You might be an expert Java programmer but you still might not know how Java was created, or how your machine understands your Java code. But that’s because it does not matter to you, at that moment. You might know the specifics of programming language creation or compilers but that does not matter. In that same way, you need to know what the models and doing but it’s alright if you don’t know how they’re doing what they’re doing. Of course, as you go further into ML, you will and should, learn the specifics of each model and how they differ from each other and which model to choose over the others for a specific type of dataset. For now, however, let us treat them as black boxes.

    Python has very extensive libraries in which we can implement these algorithms with only a few lines of code. Let us look into the Logistic Regression model now and see how each performs after training, on our test data.

Logistic Regression

    Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. (Reference Wikipedia)
Let us look at the Python code for Logistic Regression. We can also split our test and train datasets into X_test, X_train, y_test and y_train using train_test_split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_df.drop('Survived',axis=1), train_df['Survived'], test_size=#your code here, random_state=101)

We are importing the Logistic Regression model from the scikit-learn library. We are now going to create an instance of the logistic regression model.

from sklearn.linear_model import LogisticRegression
log_model = #your code here

Remember that we had two data files, test and train. We’ve also previously explained why we need two files and cannot train the model on the whole dataset. Our model that we have just imported from the scikit-learn library is untrained and so we need to train the model on train.csv. Search how that would be done.

log_model.fit(#your code here)

Now that our model has been trained on our training dataset, let us test it on the test.csv data. This will help us understand how accurate our model is on new samples, ie, samples it hasn’t seen before.

predict = log_model.#your code here(X_test)

Now that we have a list of predictions on X_test and we have the “correct” predictions on that data, ie, “Y_test”, we can now find an accuracy score for our model. There are various metrics that we can use to see our accuracy including precision, recall and F1 score. We can calculate them at the same time using classification_report.

from sklearn.metrics import classification_report
print(classification_report(y_test,predict))

This will print out values of precision, recall and F1 score for 1 (survived) and 0 (did not survive). The accuracy row will give you your models accuracy (eg: 0.79 == 79%)

Conclusion

Now that you know how to go about fitting one model, why not try looking up how to fit other models? Better yet, take another dataset (there are hundreds available on Kaggle) and see if you can get the same accuracy level. Additionally look at other people’s solutions, videos and tutorials to see how to approach a dataset in different ways. Soon you will grow comfortable with tackling datasets and can take anything thrown at you.

Happy Learning!

Drop us an email at: [email protected] to tell us how you liked this tutorial on how to go about approaching the Titanic dataset or tell us if we can make any improvements that would make it easier for other beginners to tackle this dataset.

Written by Michelle Davies Thalakottur.

0 Comments

OUR FIRST kaggle competition

tutorial to approach the titanic dataset

How to: Machine learning

Archives

Categories