Data mining project: Titanic survival prediction

Keywords: Python

Data set comes from kaggle classic competition data set

I. purpose
According to the information in the data set, we use python machine learning to predict the survival of Titanic passengers.
2, Dataset
My dataset has three formats: test, train and genderclassmodel, all in csv format

Fields in test and train data sets:

From left to right, the number of passengers, whether they are alive or not, bin, name, gender, age, number of relatives of the same generation on board, number of passengers with parents or children, ticket number, travel expenses, cabin number, port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
The genderclassmodel is the actual survival result of the test set:

3, Training ideas
Through the observation and experience of the data set, in the data set, such as passenger name, ticket number and number, these attributes have nothing to do with the result obviously and are useless information, which can be removed first. There are many null values in the data. Missing values can also be processed first. For the sample of cabin number, it is too small to delete this advanced feature.
We used KNN and random forest for training.

4, Data cleaning and feature processing`
Introduce training set and test set, delete columns of name, bed ticket number and cabin number

import numpy as np
import pandas as pd 
#1 \ data import
df_train=pd.read_csv('data/titanic/train.csv')
df_train=df_train.drop(['Name','Ticket','Cabin'],axis=1)
df_test=pd.read_csv('data/titanic/test.csv')
df_test=df_test.drop(['Name','Ticket','Cabin'],axis=1)

When handling the missing values, age and fare have more missing values. I use the average value to fill in, but there are two missing values at the port of embarkation, so they are deleted directly.

#2. Missing value handling
    df_train['Age']=df_train['Age'].fillna(df_train['Age'].mean())
    df_train['Fare']=df_train['Fare'].fillna(df_train['Fare'].mean())
    df_train=df_train.dropna()
    df_test['Age']=df_test['Age'].fillna(df_test['Age'].mean())
    df_test['Fare']=df_test['Fare'].fillna(df_test['Fare'].mean())

Normalization and standardization of eigenvalues
Since the gender and port of embarkation are string types, we need to convert them into values, replacing them with 0, 1, 2. I use normalization to convert other values into numbers between 0-1.

 #3. Feature processing
    from sklearn.preprocessing import MinMaxScaler,StandardScaler
    from sklearn.preprocessing import LabelEncoder,OneHotEncoder
    from sklearn.preprocessing import LabelEncoder,OneHotEncoder
    scaler_list=[Age,Fare,Pclass,SibSp,Parch]
    column_list=['Age','Fare','Pclass','SibSp','Parch']
    for i in range(len(scaler_list)):
        if not scaler_list[i]:
            df_train[column_list[i]]=MinMaxScaler().fit_transform(df_train[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
            df_test[column_list[i]]=MinMaxScaler().fit_transform(df_test[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
        else:
            df_train[column_list[i]]=StandardScaler().fit_transform(df_train[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
            df_test[column_list[i]]=StandardScaler().fit_transform(df_test[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
    scaler_list=[Sex,Embarked]
    column_list=['Sex','Embarked']
    for i in range(len(scaler_list)):
        if scaler_list[i]:
            df_train[column_list[i]]=LabelEncoder().fit_transform(df_train[column_list[i]])
            df_train[column_list[i]]=MinMaxScaler().fit_transform(df_train[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
            df_test[column_list[i]]=LabelEncoder().fit_transform(df_test[column_list[i]]) 
            df_test[column_list[i]]=MinMaxScaler().fit_transform(df_test[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]                                                                                                                     
    return df_train,test
def main():
    print(titanic_processing())
if __name__=='__main__':
    main()

Display result

After the features are processed, you can start modeling

Here, I choose three models for training, KNN, SVM and random forest. This model is the judgment of survival, so we choose classification model for prediction.
What's worth mentioning here is that in the data I get, the training set contains the annotation Survived. First, we need to remove it. The first step is to separate the feature part and the annotation part of the training set and the test set.

y_train=df_train['Survived'].values
    x_train=df_train.drop('Survived',axis=1).values
    y_class=pd.read_csv('data/titanic/genderclassmodel.csv')
    y_test=y_class['Survived'].values
    x_test=df_test.values

Then we start modeling. We need to select three models and compare the scores of each model, so I created a list of models.

 from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
    from sklearn.svm import SVC
    from sklearn.neighbors import NearestNeighbors,KNeighborsClassifier
    from sklearn.metrics import accuracy_score,recall_score,f1_score
    models=[]
    models.append(('KNN',KNeighborsClassifier(n_neighbors=3)))
    models.append(('SVM',SVC(C=100)))
    models.append(('Ramdomforest',RandomForestClassifier()))

Then model fitting and prediction are carried out.

  for clf_name,clf in models:
        clf.fit(x_train,y_train)
        xy_list=[(x_train,y_train),(x_test,y_test)]
        for i in range(len(xy_list)):
            x_part=xy_list[i][0]
            y_part=xy_list[i][1]
            y_pred=clf.predict(x_part)
            print(i)
            print(clf_name,'ACC:',accuracy_score(y_part,y_pred))
            print(clf_name,'REC:',recall_score(y_part,y_pred))
            print(clf_name,'f1:',f1_score(y_part,y_pred))

The complete code is as follows:

import numpy as np
import pandas as pd 
#1 \ data import
def titanic_processing(Age=False,Fare=False,Pclass=False,SibSp=False,Parch=False,Sex=True,Embarked=True):
    df_train=pd.read_csv('data/titanic/train.csv')
    df_train=df_train.drop(['Name','Ticket','Cabin','PassengerId'],axis=1)
    df_test=pd.read_csv('data/titanic/test.csv')
    df_test=df_test.drop(['Name','Ticket','Cabin','PassengerId'],axis=1)
#2. Missing value handling
    df_train['Age']=df_train['Age'].fillna(df_train['Age'].mean())
    df_train['Fare']=df_train['Fare'].fillna(df_train['Fare'].mean())
    df_train=df_train.dropna()
    
    df_test['Age']=df_test['Age'].fillna(df_test['Age'].mean())
    df_test['Fare']=df_test['Fare'].fillna(df_test['Fare'].mean())
   
    #3. Feature processing
    from sklearn.preprocessing import MinMaxScaler,StandardScaler
    from sklearn.preprocessing import LabelEncoder,OneHotEncoder
    from sklearn.preprocessing import LabelEncoder,OneHotEncoder
    scaler_list=[Age,Fare,Pclass,SibSp,Parch]
    column_list=['Age','Fare','Pclass','SibSp','Parch']
    for i in range(len(scaler_list)):
        if not scaler_list[i]:
            df_train[column_list[i]]=MinMaxScaler().fit_transform(df_train[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
            df_test[column_list[i]]=MinMaxScaler().fit_transform(df_test[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
        else:
            df_train[column_list[i]]=StandardScaler().fit_transform(df_train[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
            df_test[column_list[i]]=StandardScaler().fit_transform(df_test[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
    scaler_list=[Sex,Embarked]
    column_list=['Sex','Embarked']
    for i in range(len(scaler_list)):
        if scaler_list[i]:
            df_train[column_list[i]]=LabelEncoder().fit_transform(df_train[column_list[i]])
            df_train[column_list[i]]=MinMaxScaler().fit_transform(df_train[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
            df_test[column_list[i]]=LabelEncoder().fit_transform(df_test[column_list[i]]) 
            df_test[column_list[i]]=MinMaxScaler().fit_transform(df_test[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]                                                                                                                     
    return df_train,df_test
def titanic_model(df_train,df_test):
    y_train=df_train['Survived'].values
    x_train=df_train.drop('Survived',axis=1).values
    y_class=pd.read_csv('data/titanic/genderclassmodel.csv')
    y_test=y_class['Survived'].values
    x_test=df_test.values
    from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
    from sklearn.svm import SVC
    from sklearn.neighbors import NearestNeighbors,KNeighborsClassifier
    from sklearn.metrics import accuracy_score,recall_score,f1_score
    models=[]
    models.append(('KNN',KNeighborsClassifier(n_neighbors=3)))
    models.append(('SVM',SVC(C=100)))
    models.append(('Ramdomforest',RandomForestClassifier()))
    for clf_name,clf in models:
        clf.fit(x_train,y_train)
        xy_list=[(x_train,y_train),(x_test,y_test)]
        for i in range(len(xy_list)):
            x_part=xy_list[i][0]
            y_part=xy_list[i][1]
            y_pred=clf.predict(x_part)
            print(i)
            print(clf_name,'ACC:',accuracy_score(y_part,y_pred))
            print(clf_name,'REC:',recall_score(y_part,y_pred))
            print(clf_name,'f1:',f1_score(y_part,y_pred))
def main():
    df_train,df_test=titanic_processing()
    titanic_model(df_train,df_test)
if __name__=='__main__':
    main()

Let's look at the evaluation results:

0
KNN ACC: 0.875140607424072
KNN REC: 0.8
KNN f1: 0.8305343511450382
1
KNN ACC: 0.8181818181818182
KNN REC: 0.8156028368794326
KNN f1: 0.7516339869281047
0
SVM ACC: 0.84251968503937
SVM REC: 0.6764705882352942
SVM f1: 0.7666666666666666
1
SVM ACC: 0.8779904306220095
SVM REC: 0.8014184397163121
SVM f1: 0.8158844765342961
0
Ramdomforest ACC: 0.9820022497187851
Ramdomforest REC: 0.9647058823529412
Ramdomforest f1: 0.9761904761904762
1
Ramdomforest ACC: 0.7990430622009569
Ramdomforest REC: 0.6950354609929078
Ramdomforest f1: 0.7

0 for training set, 1 for test set.
In general, the results of KNN are better than those of the other two. The fitting effect of random forest training set is very good, but the test set is not satisfactory.

Summary:
We are familiar with the basic process of data mining and machine learning through the actual combat of the survival prediction project of Titanic. Of course, machine learning is a very deep topic, and we need to continue to study hard to understand the algorithm principle of various models, so as to select the appropriate model for analysis in the specific project.

Published 1 original article, praised 0 and visited 3
Private letter follow

Posted by jplock on Sat, 22 Feb 2020 22:20:40 -0800