Titanic survivor prediction

Keywords: Machine Learning Pytorch Deep Learning

preface

Tip: Here you can add the general contents to be recorded in this article:
For example, with the continuous development of artificial intelligence, machine learning technology is becoming more and more important. Many people have started learning machine learning. This paper introduces the basic content of machine learning.

Tip: the following is the main content of this article. The following cases can be used for reference

step

1. Import library

import pandas as pd
from sklearn.tree import DecisionTreeClassifier #Classification tree
from sklearn.model_selection import train_test_split #Partition test set and training set
from sklearn.model_selection import GridSearchCV #Do grid search to adjust parameters
from sklearn.model_selection import cross_val_score #Do cross validation
import numpy as np
import matplotlib.pyplot as plt

2. Use pandas to read csv files

data = pd.read_csv('data.csv')

3. Explore the information of csv file

Basic information of data

data.info()

The first five lines of data to see what it looks like

data.head()

4. Data preprocessing

4.1 features irrelevant to the training model will be deleted

It can be seen from the example above that name, ticket and cabin are basically invalid for the training model, so unnecessary features are deleted

data.drop(['Name','Ticket','Cabin'],inplace=True,axis=1)
'''
axis=1 : Indicates to delete this column
inplace=True : Indicates that the original is overwritten data
'''
data.head()
data.info()


4.2 convert all non numeric features to numeric features

Note: the model can only process data of numeric type

Conduct type conversion for Sex (gender is only male and female, so it is binary classification. For binary classification, there is a simple method)

data.loc[:,'Sex'] = (data.loc[:,'Sex'] == 'male').astype('int')
'''
loc[:,'Sex']: Representative take out data All the lines inside, Sex column
(data.loc[:,'Sex'] == 'male')Judge whether it is True,Returned is bool Type, will bool Type to int Type, at this time True Is 1, False Is 0
'''

Type change Embarked

label = data.loc[:,'Embarked'].unique().tolist()
#The array obtained by converting the original embanked type to label is shown in the following table (int)
data.loc[:,'Embarked'] = data.loc[:,'Embarked'].apply(lambda x: label.index(x))
>label: ['S', 'C', 'Q']
'''
unique: take Embaeked Take out all the data in and merge them repeatedly
tolist: Convert the resulting array to a list type
apply:  Use functions in parentheses

'''

4.3 quantity of unified data

For Age, fill in fillna, and fill in those without Age with average Age

data.loc[:,'Age'] = data.loc[:,'Age'].fillna(data.loc[:,'Age'].mean())

For embanked, only two values are missing. Delete these two lines for all features without much impact on the whole

data = data.dropna(axis=0)
'''
dropna: Filter missing data
data.dropna(how = 'all')    # When this parameter is passed in, only those rows with all missing values will be discarded
data.dropna(axis = 1)       # Discard columns with missing values (this is not usually done, which will delete a feature)
data.dropna(axis=1,how="all")   # Discard those columns that are all missing values
data.dropna(axis=0,subset = ["Age", "Sex"])   # Discard rows with missing values in the 'Age' and 'Sex' columns  
'''

5. Split the data set and separate the data features and labels (separate the survived results from the rest of the data)

x = data.loc[:,data.columns != 'Survived']
y = data.loc[:,data.columns == 'Survived']

6. Divide the data set into training set and test set

Xtrain, Xtest, Ytrain, Ytest = train_test_split(x,y,test_size=0.3)

7. Sort the divided test sets and training sets (form a habit)

for i in [Xtrain, Xtest, Ytrain, Ytest]:
    i.index = range(i.shape[0])

8. Train the model

The training of the model is to try more and find the best

Train the model directly normally

clf = DecisionTreeClassifier(random_state=20)
clf = clf.fit(Xtrain,Ytrain)
score_ = clf.score(Xtest,Ytest)

Use cross validation to train

score = cross_val_score(clf,x,y,cv=10).mean()

For cross validation to plot, look at the score values at different depths

tr = []
te = []
for i in range(10):
    clf = DecisionTreeClassifier(random_state=20
                                ,max_depth = i+1
                                ,criterion = 'entropy'
                                )
    clf = clf.fit(Xtrain,Ytrain)
    score_tr = clf.score(Xtrain,Ytrain)
    score_te = cross_val_score(clf,x,y,cv=10).mean()
    tr.append(score_tr)
    te.append(score_te)

plt.figure()
plt.plot(range(1,11),tr,color='red',label='train')
plt.plot(range(1,11),te,color='blue',label='test')
plt.xticks(range(1,11))
plt.legend()
plt.show()

9. Adjust the optimal parameters through grid search

The parameters in paramters are combined according to the requirements. If there are too many computers, it is best to combine them in pairs

paramters = {'splitter': ('best','random')
            ,'criterion': ('gini','entropy')
            ,'max_depth': [*range(1,10)] #A 1-10 list [* range()]
            ,'min_samples_leaf': [*range(1,50,5)]
            ,'min_impurity_decrease': [*np.linspace(0,0.5,20)]
    
}

clf = DecisionTreeClassifier(random_state=20)
GS = GridSearchCV(clf,paramters,cv=10) #Grid search integrates all steps such as cross validation
GS.fit(Xtrain,Ytrain)

Optimal parameters

GS.best_params_

Evaluation value under current optimal parameters

GS.best_score_

be careful

For the above grid search, the optimal parameter is the optimal value in the content that must contain the above added parameters, that is to say, the grid search does not
To delete the parameters added to the search, it is possible that after deleting some parameters, the evaluation value will become higher, so we need to constantly debug ourselves. No
Break the test to find the optimal evaluation value

For the learning of decision tree, see the following link:
Introduction to decision tree, there are links to learning related to decision tree below

Posted by wildwobby on Fri, 15 Oct 2021 12:02:21 -0700