Supervised classification with breast cancer data

Article directory

Data exploration
Data preprocessing
model building
Display of forecast results
conclusion

Data exploration

The original data download address is: Portal

The website describes the data as follows:

It can be seen that there are 699 samples in the original data, each sample has 11 different columns of values: 1 column of ID for retrieval, 9 columns of medical characteristics related to tumor, and the last column of values representing tumor type. All the 9 columns used to represent the medical characteristics of tumors were quantified as numbers between 1 and 10, and the types of tumors were also referred to as benign and malignant by the numbers 2 and 4 respectively. This data also states that it contains missing values. In fact, the problem of missing values widely exists in real data, which is also an unavoidable problem for machine learning tasks.

Data preprocessing

The following code is used to preprocess the original tumor data:

#Import pandas and numpy toolkits.
import pandas as pd
import numpy as np
#Create a feature list.
column_names = ['Sample code number', 'Clump Thickness', 
                'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
                'Marginal Adhesion','Single Epithelial CellSize', 
                'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 
                'Mitoses', 'Class']
#Use the pandas.readcsv function to read the specified data from the Internet.
data = pd.read_csv('breast-cancer-wisconsin.data', names=column_names)
#Replace? With the standard missing value.
data = data.replace (to_replace='?',value= np.nan)
#Discard data with missing values (as long as one dimension is missing).
data = data.dropna(how='any')
#Output data volume and dimension.
data.shape

(683, 11)

After data processing, there are 683 samples without missing values, including 9 dimensions, such as cell thickness, cell size, shape, etc., and the characteristics of each dimension are quantified as values between 1 and 10.

print(data.head())

   Sample code number  Clump Thickness  Uniformity of Cell Size  \
0             1000025                5                        1   
1             1002945                5                        4   
2             1015425                3                        1   
3             1016277                6                        8   
4             1017023                4                        1   

   Uniformity of Cell Shape  Marginal Adhesion  Single Epithelial CellSize  \
0                         1                  1                           2   
1                         4                  5                           7   
2                         1                  1                           2   
3                         8                  1                           3   
4                         1                  3                           2   

  Bare Nuclei  Bland Chromatin  Normal Nucleoli  Mitoses  Class  
0           1                3                1        1      2  
1          10                3                2        1      2  
2           2                3                1        1      2  
3           4                3                7        1      2  
4           1                3                1        1      2

Since the original data does not provide the corresponding test samples to evaluate the model performance, it is necessary to
The data is divided. 15% of the data will be used as the test set, and the remaining 75% will be used for training.

#Use the train test split module in sklearn.cross-validation to split the data.
from sklearn.cross_validation import train_test_split
#Random sampling 25% of the data for testing, the remaining 75% for building training sets.
x_train, x_test, y_train,y_test = train_test_split (data [column_names[1:10]], 
                                                    data [column_names[10]], 
                                                    test_size=0.25, random_state= 33)

#Check the number and category distribution of training samples.
y_train.value_counts()

2    344
4    168
Name: Class, dtype: int64

#Check the number and category distribution of test samples.
y_test.value_counts()

2    100
4     71
Name: Class, dtype: int64

To sum up, we used 512 training samples (344 benign tumor data, 168 malignant tumor data) to test
There were 171 samples (100 benign tumor data, 71 malignant tumor data).

model building

Next, we use Logistic regression and random gradient parameter estimation
Methods the training data after the above processing were studied and predicted according to the characteristics of test samples.

 #Guide StandardScaler from sklearn.preprocessing
from sklearn. preprocessing import StandardScaler
#From the sklearn. Linear? Model, guide LogisticRegression and SGDClassifier
from sklearn. linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
#Standardize the data to ensure that the variance of each dimension's characteristic data is 1 and the mean value is 0. So that the prediction results will not be dominated by some dimension too large eigenvalues.
ss = StandardScaler ()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)
#Initialize logisticrenewal and SGDClassifier
lr = LogisticRegression ()
sgdc = SGDClassifier ()
#Call the fit function in LogisticRegression to train model parameters.
lr.fit(x_train, y_train)
#The trained model LR is used to predict the x'u test, and the results are stored in the variable lr'y predict.
lr_y_predict = lr.predict(x_test)
#Call the fit function in SGDClassifier to train the model parameters.
sgdc.fit (x_train, y_train)
#The trained model sgdc is used to predict the X ﹣ test, and the results are stored in the variable sgdc ﹣ y ﹣ predict.
sgdc_y_predict = sgdc.predict(x_test)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)

Display of forecast results

Using logistic regression and SGDClassifier to predict 171 test samples respectively. Because these 171 test samples have correct marks and are recorded in the variable y_test, it is very intuitive to compare the predicted results with the original correct marks and calculate the correct percentage of the 171 test samples, that is, the correct rate.

#Guide the classification report module from sklearn. Metrics.
from sklearn.metrics import classification_report
#Use the score function of Logistic regression model to get the accuracy of the model on the test set.
print('Accuracy of LR Classifier:', lr.score(x_test, y_test))
#The classification report module is used to get the results of the other three indicators of LogisticRegression.
print(classification_report(y_test, lr_y_predict, target_names = ['Benign','Malignant']))

Accuracy of LR Classifier: 0.9883040935672515
             precision    recall  f1-score   support

     Benign       0.99      0.99      0.99       100
  Malignant       0.99      0.99      0.99        71

avg / total       0.99      0.99      0.99       171

#The score function of the random gradient descent model is used to get the accuracy of the model on the test set.
print ('Accuarcy of SGD Classifier:', sgdc.score(x_test, y_test))
 #The other three indexes of SGDClassifier are obtained by using the classification report module.
print (classification_report(y_test, sgdc_y_predict, target_names= [' Benign','Malignant']))

Accuarcy of SGD Classifier: 0.9824561403508771
             precision    recall  f1-score   support

     Benign       0.98      0.99      0.99       100
  Malignant       0.99      0.97      0.98        71

avg / total       0.98      0.98      0.98       171

conclusion

After reading the code 16 output report, we can find that: logistic regression has a higher accuracy in test set performance than SGDClassifier. This is because seikit learn uses analytic method to calculate the parameters of LogisticRegression accurately, and uses gradient method to estimate the parameters of SGDClassifier.

Juvenile Ji

62 original articles published, 64 praised, 80000 visitors+

Private letter follow

Posted by phpPete on Mon, 10 Feb 2020 23:36:08 -0800

Programmer Group