Simple operation summary of linear regression, logistic regression and svm based on sklearn

Keywords: PHP encoding

Some basic operations of AI algorithm based on sklearn

Some related libraries in sklearn s

Import the libraries of these related algorithms separately

import pandas as pd #Import a container for reading csv data
from sklearn.model_selection import train_test_split #Modules for data set partitioning
from sklearn.model_selection import GridSearchCV #Modules for cross-validation
from sklearn.neighbors import KNeighborsClassifier #Module of knn algorithm
from sklearn.linear_model import LinearRegression  #Modules of Linear Regression Algorithms
from sklearn.linear_model import LogisticRegression #Logical Regression Algorithms Module
from sklearn.svm import SVC #Module of SVC algorithm
import matplotlib.pyplot as plt #Visual Drawing Module
import warnings #Here is the module for ignoring warnings. The warnings. filter warnings ("ignore") statement ignores warnings.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler #Feature preprocessing module
import numpy as np 
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler #Feature preprocessing module, one-hot coding and homogenization processing

Basic ideas;

Define the label of feature and target - > read the whole data set - > read feature and label data set XY - > partition data set (test set, training set) - > declare algorithm model - > training, test calculation accuracy

Columns defining features and objectives

fruit_label = { 'apple': 0,'mandarin': 1,'orange': 2,'lemon': 3}  #A dictionary is used to represent the mapping relationship of fruit labels, which should be used in supervised classification.

feature_data=['mass','width','height','color_score'] #Using a sequence to illustrate that features contain these kinds

Read the entire data set

 Data_fruit = pd. read_csv (file path) # Reads data from the entire CSV file

Mapping generates label columns (the original result label is text, generating a number) (supervised classification KNN needs)

  data_fruit['Label']=data_fruit['fruit_name'].map(fruit_label) 
  #A new label column is generated with the column name Lable and the data is 0-3 data generated from the fruit_name column data (0-3 is defined above).
  #It also needs to be used in supervised classification. It seems that supervised prediction is not necessary.

Read the features and labels in the data set separately

 X=data_fruit[feature_data].values #The columns in the previously defined feature sequence are read as X arrays
 y=data_fruit['Label'].values #Read the data in the label column as a Y array
 
 #The data[key] type is the series type of pandas; the data[key].values type is the ndarray type of numpy data

Partition of data sets (training set, test set)

#The whole data set is divided into four parts: training set and testing set, and training set and testing set of result label.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 / 3, random_state=10)
 
 #The above statement indicates that X and y are partitioned, in which test sets account for one third of the total data and are randomly partitioned (to prevent some label data from being trained due to the orderly discharge of the original data)

Declaring Algorithmic Model, Training and Calculating Accuracy

Declarations, Training and Accuracy Computation of KNN Algorithms

knn_model=KNeighborsClassifier()
knn_model.fit(X_train,y_train) #Training the training set
accuracy=knn_model.score(X_test,y_test) #Testing and Calculating Accuracy of Test Set

Declarations, Training and Accuracy Computation of Linear Regression Algorithms

  linear_reg_model = LinearRegression()
  linear_reg_model.fit(X_train, y_train)
  r2_score = linear_reg_model.score(X_test, y_test)
  
  #The value of a single sample; X_test[i,:] represents all the data of line I in the test set
  

Declarations, Training and Accuracy Computation of Logical Regression Algorithms

  LogisticRegression_model=LogisticRegression()
  LogisticRegression_model.fit(X_train, y_train)
  r2_score =LogisticRegression_model.score(X_test, y_test)
 
 

Statement, Training and Accuracy Computation of SVM Algorithms

  SVM_model=SVC()
  SVM_model.fit(X_train, y_train)
  accuracy= SVM_model.score(X_test, y_test)

Finding the Optimal Superparameters (K in KNN, C in logistic regression, C in SVM)

Ideas of K determination in KNN 1:

 Define a K Sequence, put in some that you want to test K The value of the value, and then traverse Kļ¼ŒRepeat the basic operations mentioned above (reading feature labels in data sets, partitioning data sets, training and testing)
 
 k_sets=[3,5,8]
 
 for k_set in k_sets:
    round_function(fruit_data,k_set)#You need fruit_data here, so you define this function to pass in the fruit_data parameter
    
 def round_function(fruit_data,k_set):
    X=fruit_data[feature_data].values 
    y=fruit_data['Label'].values
    
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=1 / 3, random_state=10)
    
    knn_model=KNeighborsClassifier(n_neighbors=k_set)#Pass in parameter K for validation, default is 5
    knn_model.fit(X_train,y_train)
    accuracy=knn_model.score(X_test,y_test)

    print('K by{}The accuracy of time is{:.2f}%'.format(k_set,accuracy*100))

Cross-validation (for example, grid search)

Cross-validation: In the parameter adjustment test (the test to determine the optimal super-parameters), the training set is divided into N parts, each part of N parts is tested as a test set in turn, then the accuracy of each part is calculated, and the mean of N accuracy is calculated. N denotes fold.

For a grid search with multiple hyperparameters to be determined, sklearn s use the GridSearchCV () feature processing. For example, in the following case, two parameters k and p in kNN need to be debugged.

Step: Define the dictionary to be validated-->Ergodic test for printing accuracy
model_dict={
     'KNN': (KNeighborsClassifier(),{'n_neighbors':[3,5,7],'p':[1,2]}),
    'Logistic':(LogisticRegression(),{'C':[1e-2,1,1e2]}),
    'SVM':(SVC(),{'C':[1e-2,1,1e2] })
   }
 
#Ergodic comparison
for model_name,(model,model_param) in model_dict.items():

    #Training model to select the best parameters
    clf=GridSearchCV(estimator=model, param_grid=model_param, cv=5) #Put in the algorithm model, parameters to be determined, set to 5 fold cross validation
    clf.fit(X_train,y_train)
    best_model=clf.best_estimator_
    
    #Accuracy of calculation
    acc=best_model.score(X_test,y_test)
    
    #Print comparison
    print('{}The best parameter is{}The accuracy is{:.2f}%'.format(model_name,best_model,acc*100))

Finding the Optimal Algorithms

Define an algorithm dictionary and traverse it

   #Define an algorithm dictionary and match the algorithm name with the model.
   model_dict={'KNN': KNeighborsClassifier(n_neighbors=7), 'Logistic':LogisticRegression(C=1), 'SVM':SVC(C=1) }
   
   #Ergodic comparison
   for model_name,model in model_dict.items():
    #Training model
    model.fit(X_train,y_train)
    #Accuracy of calculation
    acc=model.score(X_test,y_test)
    #Print comparison
    print('{}The accuracy is{:.2f}%'.format(model_name,acc*100))

Be careful

items() is used in dictionary traversal to extract the corresponding key and value

Visual Drawing

Call the matplotlib drawing module, define a drawing method, draw the basic process line: first create a Graphics instance (plt.figure() - > drawing - > display (plt.show())

 def plot_fitting_line(linear_reg_model, X, y, feat):
    """
    //Drawing Linear Regression Line
    """
    w = linear_reg_model.coef_  #Use coef_to get weight
    b = linear_reg_model.intercept_  # intercept_Gets the bias term

    plt.figure()#Create a graphic instance, equivalent to a canvas
    
    # Scatter plots of real values
    plt.scatter(X, y, alpha=0.5) #Draw scatter plots based on the real values of the X-axis (feature), with a transparency of 50%.

    # straight line
    plt.plot(X, w * X + b, c='red')#Draw a straight line map based on the predicted value of the x-axis (feature) and draw a red line
    plt.title(Title) #Title of image
    plt.show() #Display image

Drawing imports X,Y axes and other qualified parameters

Be careful

The parameters that need to be used will be passed in. The parameters that are passed in above are all

Feature preprocessing

Features can be divided into: digital features, ordered features, type features (such as gender), for digital features can be normalized, for type features can be one-hot coding.

Step: Declare feature type - > pretreat - > use the trained feature to test the main function after processing.

# Characteristic columns used
NUM_FEAT_COLS = ['AGE','BMI', 'BP', 'S1', 'S2','S3','S4','S5','S6']#Digital features
CAT_FEAT_COLS=[ 'SEX']#Categorical features

#Define the preprocessing method, the characteristics of the incoming training set and feature set
def process_features(X_train, X_test):

# 1. one-hot encoding for categorical features
encoder = OneHotEncoder(sparse=False)
encoded_tr_feat = encoder.fit_transform(X_train[CAT_FEAT_COLS])
encoded_te_feat = encoder.transform(X_test[CAT_FEAT_COLS])
#Because we need to get the feature data by column name, we don't add. values when we divide X and y in the main function, because this is numpy. Then the feature processing function of training set is different from that of test set.

# 2. Normalization of numerical eigenvalues
scaler = MinMaxScaler()
scaled_tr_feat = scaler.fit_transform(X_train[NUM_FEAT_COLS])
scaled_te_feat = scaler.transform(X_test[NUM_FEAT_COLS])

# 3. Feature merging
X_train_proc = np.hstack((encoded_tr_feat, scaled_tr_feat))
X_test_proc = np.hstack((encoded_te_feat, scaled_te_feat))

return X_train_proc, X_test_proc

Posted by JukEboX on Tue, 23 Jul 2019 17:41:29 -0700