For those of you who are not familiar with Kaggle.com, the latter website is a competition website concerned with data science and Machine Learning problems. Several commercial and non-profit companies and organization make their problems open to the public domain in hopes that they can find solutions to their own data problems or improve the performance of their existing ones.
No matter what your level of expertise is with machine learning, there are many beginner-level problems to learn from and solve. There are also several real-world competitive problems that if solved efficiently would grant the winner teams up to 100,000$ as well as reputation points on the leaderboard. It is also a great source for learning machine learning approaches to solving problems. For more information on kaggle, I suggest browsing their homepage for more details.
The “Titanic: Machine Learning from Disaster” problem is considered a getting started problem for beginners to familiarize themselves with the basic concepts and techniques of machine learning. The problem is as follows, given two datasets a training dataset (891 records) and a testing dataset (417 records) both containing information of 1,309 passengers who were onboard of the Titanic (Name, Sex, Ticket, Fare, Class, etc.). The training dataset contains each passenger information along with a binary column “Survived” indicating wither this particular passenger has survived (1) or not(0). The testing dataset; however, only contains information of the passenger without the “Survived” column. Now, using Machine Learning algorithms and techniques can you predict who survived in the testing dataset given that you have learned from the data given in the training set.
In this post I will explain how would you go about loading, preprocessing, cleaning and scaling your data. Then how to decide which features are most critical for your machine learning algorithm which in our case is the Random Forrest Classifier. We will also look into how to optimize our classifier’s hyperparameters; such as, the number of trees in the forrest, the number of features to consider during splits and so on. Finally, once we have clean, preprocessed data, the most important features to use and a fine-tuned classifier we will fit the data into the classifier and come up with a prediction. The complete python code is available on Github.
This solution requires you to be familiar with the Python programming language. We will use the famous data manipulation library Pandas to easily manipulate and massage our datasets. Moreover, we will be using Matplotlib to plot some figures that will give us insight on our data and classifier. Finally and most importantly, we will rely on Scikit-Learn power for the machine learning part of this solution.
The starting execution point of the program is main.py. The following code snippet from main.py shows how we broke down the problem into 6 major sub-problems.
# load data print("Loading Data...") DIR = "./data" train_df, test_df = load_data(DIR) # preprocess, massage, scale, merge and clean data print("Preprocessing Data...") train_df, test_df = preprocess_data(train_df, test_df) # use only most important features print("Extracting Most Important Features...") train_df, test_df = use_most_important_features(train_df, test_df) # optimize hyperparameters print("Optimizing Hyperparameters...") optimize_hyperparameters(train_df) # plot learning curves print("Plot Learning Curves...") plot_learning_curves(train_df) plot_ROC_curve(train_df) # predict survival print("Predict Survival...") predict_survival(train_df, test_df)
The sub-problems are:
1. Loading the datasets.
2. Preprocessing the data.
3. Extracting most important features.
4. Optimizing the classifier’s hyperparameters.
5. Plotting learning curves.
6. Making a prediction.
1. Loading the Datasets
The training and testing csv files are located in the data folder. I obtained them originally from kaggle’s titanic page. Our aim in this section is to read both training and testing datasets properly from csv and load them into memory. Because the size of the dataset is considerably small we can load it directly into memory. The script that defines the function load_data(directory) is located in utils/load.py.
# utility function to load data files into two pandas dataframes def load_data(directory): train_df = pd.read_csv(join(directory, 'train.csv'), header=0) test_df = pd.read_csv(join(directory, 'test.csv'), header=0) return train_df, test_df
Note that both train_df and test_df are pandas DataFrames. A dataframe is a cool and very flexible data structure that allows us to manipulate tabular heterogenous data easily. If you are running the code in a python shell (Hopefully IPython) then you can do something like train_df.info() to get information on the training dataframe, or even train_df.head(10) to print the first 10 records in the dataframe. Learning about pandas can server you a long way if you are serious about your data endeavors.
Once we load the data, running train_df.info() and test_df.info() we can get a bunch of interesting information on each dataframe. The training dataset contains 891 entries indexed from 0 to 890. Not to be confused with PassengerId. The index is a pandas automatically assigned index iterator that is not included as part of the dataset. The columns are (PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked). If you look closely you will notice that (Age, Cabin and Embarked) columns have missing values. Moreover, you can see that the datatypes in the columns are heterogenous. Meaning they are not all of the same type which is a cool property of pandas dataframes.
test_df on the other hand have 418 entries and one column less which is the Survived column. In fact, we have to come up with the values of the survived column and submit it as the result to kaggle website. You can also notice here that there are missing values as well (Age, Fare and Cabin).
The title of each column is pretty much self explanatory except for some that might look confusing.
Pclass = Passenger Class
SibSp =Siblings and/or Spouse
Parch = Parents and/or Children
To be continued…