Some basic operations of AI algorithm based on sklearn
Some related libraries in sklearn s
Import the libraries of these related algorithms separately
import pandas as pd #Import a container for reading csv data from sklearn.model_selection import train_test_split #Modules for data set partitioning from sklearn.model_selection import GridSearchCV #Modules for cross-validation from sklearn.neighbors import KNeighborsClassifier #Module of knn algorithm from sklearn.linear_model import LinearRegression #Modules of Linear Regression Algorithms from sklearn.linear_model import LogisticRegression #Logical Regression Algorithms Module from sklearn.svm import SVC #Module of SVC algorithm import matplotlib.pyplot as plt #Visual Drawing Module import warnings #Here is the module for ignoring warnings. The warnings. filter warnings ("ignore") statement ignores warnings. from sklearn.preprocessing import OneHotEncoder, MinMaxScaler #Feature preprocessing module import numpy as np from sklearn.preprocessing import OneHotEncoder, MinMaxScaler #Feature preprocessing module, one-hot coding and homogenization processing
Basic ideas;
Define the label of feature and target - > read the whole data set - > read feature and label data set XY - > partition data set (test set, training set) - > declare algorithm model - > training, test calculation accuracy
Columns defining features and objectives
fruit_label = { 'apple': 0,'mandarin': 1,'orange': 2,'lemon': 3} #A dictionary is used to represent the mapping relationship of fruit labels, which should be used in supervised classification. feature_data=['mass','width','height','color_score'] #Using a sequence to illustrate that features contain these kinds
Read the entire data set
Data_fruit = pd. read_csv (file path) # Reads data from the entire CSV file
Mapping generates label columns (the original result label is text, generating a number) (supervised classification KNN needs)
data_fruit['Label']=data_fruit['fruit_name'].map(fruit_label) #A new label column is generated with the column name Lable and the data is 0-3 data generated from the fruit_name column data (0-3 is defined above). #It also needs to be used in supervised classification. It seems that supervised prediction is not necessary.
Read the features and labels in the data set separately
X=data_fruit[feature_data].values #The columns in the previously defined feature sequence are read as X arrays y=data_fruit['Label'].values #Read the data in the label column as a Y array #The data[key] type is the series type of pandas; the data[key].values type is the ndarray type of numpy data
Partition of data sets (training set, test set)
#The whole data set is divided into four parts: training set and testing set, and training set and testing set of result label. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 / 3, random_state=10) #The above statement indicates that X and y are partitioned, in which test sets account for one third of the total data and are randomly partitioned (to prevent some label data from being trained due to the orderly discharge of the original data)
Declaring Algorithmic Model, Training and Calculating Accuracy
Declarations, Training and Accuracy Computation of KNN Algorithms
knn_model=KNeighborsClassifier() knn_model.fit(X_train,y_train) #Training the training set accuracy=knn_model.score(X_test,y_test) #Testing and Calculating Accuracy of Test Set
Declarations, Training and Accuracy Computation of Linear Regression Algorithms
linear_reg_model = LinearRegression() linear_reg_model.fit(X_train, y_train) r2_score = linear_reg_model.score(X_test, y_test) #The value of a single sample; X_test[i,:] represents all the data of line I in the test set
Declarations, Training and Accuracy Computation of Logical Regression Algorithms
LogisticRegression_model=LogisticRegression() LogisticRegression_model.fit(X_train, y_train) r2_score =LogisticRegression_model.score(X_test, y_test)
Statement, Training and Accuracy Computation of SVM Algorithms
SVM_model=SVC() SVM_model.fit(X_train, y_train) accuracy= SVM_model.score(X_test, y_test)
Finding the Optimal Superparameters (K in KNN, C in logistic regression, C in SVM)
Ideas of K determination in KNN 1:
Define a K Sequence, put in some that you want to test K The value of the value, and then traverse Kļ¼Repeat the basic operations mentioned above (reading feature labels in data sets, partitioning data sets, training and testing) k_sets=[3,5,8] for k_set in k_sets: round_function(fruit_data,k_set)#You need fruit_data here, so you define this function to pass in the fruit_data parameter def round_function(fruit_data,k_set): X=fruit_data[feature_data].values y=fruit_data['Label'].values X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=1 / 3, random_state=10) knn_model=KNeighborsClassifier(n_neighbors=k_set)#Pass in parameter K for validation, default is 5 knn_model.fit(X_train,y_train) accuracy=knn_model.score(X_test,y_test) print('K by{}The accuracy of time is{:.2f}%'.format(k_set,accuracy*100))
Cross-validation (for example, grid search)
Cross-validation: In the parameter adjustment test (the test to determine the optimal super-parameters), the training set is divided into N parts, each part of N parts is tested as a test set in turn, then the accuracy of each part is calculated, and the mean of N accuracy is calculated. N denotes fold.
For a grid search with multiple hyperparameters to be determined, sklearn s use the GridSearchCV () feature processing. For example, in the following case, two parameters k and p in kNN need to be debugged.
Step: Define the dictionary to be validated-->Ergodic test for printing accuracy model_dict={ 'KNN': (KNeighborsClassifier(),{'n_neighbors':[3,5,7],'p':[1,2]}), 'Logistic':(LogisticRegression(),{'C':[1e-2,1,1e2]}), 'SVM':(SVC(),{'C':[1e-2,1,1e2] }) } #Ergodic comparison for model_name,(model,model_param) in model_dict.items(): #Training model to select the best parameters clf=GridSearchCV(estimator=model, param_grid=model_param, cv=5) #Put in the algorithm model, parameters to be determined, set to 5 fold cross validation clf.fit(X_train,y_train) best_model=clf.best_estimator_ #Accuracy of calculation acc=best_model.score(X_test,y_test) #Print comparison print('{}The best parameter is{}The accuracy is{:.2f}%'.format(model_name,best_model,acc*100))
Finding the Optimal Algorithms
Define an algorithm dictionary and traverse it
#Define an algorithm dictionary and match the algorithm name with the model. model_dict={'KNN': KNeighborsClassifier(n_neighbors=7), 'Logistic':LogisticRegression(C=1), 'SVM':SVC(C=1) } #Ergodic comparison for model_name,model in model_dict.items(): #Training model model.fit(X_train,y_train) #Accuracy of calculation acc=model.score(X_test,y_test) #Print comparison print('{}The accuracy is{:.2f}%'.format(model_name,acc*100))
Be careful
items() is used in dictionary traversal to extract the corresponding key and value
Visual Drawing
Call the matplotlib drawing module, define a drawing method, draw the basic process line: first create a Graphics instance (plt.figure() - > drawing - > display (plt.show())
def plot_fitting_line(linear_reg_model, X, y, feat): """ //Drawing Linear Regression Line """ w = linear_reg_model.coef_ #Use coef_to get weight b = linear_reg_model.intercept_ # intercept_Gets the bias term plt.figure()#Create a graphic instance, equivalent to a canvas # Scatter plots of real values plt.scatter(X, y, alpha=0.5) #Draw scatter plots based on the real values of the X-axis (feature), with a transparency of 50%. # straight line plt.plot(X, w * X + b, c='red')#Draw a straight line map based on the predicted value of the x-axis (feature) and draw a red line plt.title(Title) #Title of image plt.show() #Display image
Drawing imports X,Y axes and other qualified parameters
Be careful
The parameters that need to be used will be passed in. The parameters that are passed in above are all
Feature preprocessing
Features can be divided into: digital features, ordered features, type features (such as gender), for digital features can be normalized, for type features can be one-hot coding.
Step: Declare feature type - > pretreat - > use the trained feature to test the main function after processing.
# Characteristic columns used NUM_FEAT_COLS = ['AGE','BMI', 'BP', 'S1', 'S2','S3','S4','S5','S6']#Digital features CAT_FEAT_COLS=[ 'SEX']#Categorical features #Define the preprocessing method, the characteristics of the incoming training set and feature set def process_features(X_train, X_test): # 1. one-hot encoding for categorical features encoder = OneHotEncoder(sparse=False) encoded_tr_feat = encoder.fit_transform(X_train[CAT_FEAT_COLS]) encoded_te_feat = encoder.transform(X_test[CAT_FEAT_COLS]) #Because we need to get the feature data by column name, we don't add. values when we divide X and y in the main function, because this is numpy. Then the feature processing function of training set is different from that of test set. # 2. Normalization of numerical eigenvalues scaler = MinMaxScaler() scaled_tr_feat = scaler.fit_transform(X_train[NUM_FEAT_COLS]) scaled_te_feat = scaler.transform(X_test[NUM_FEAT_COLS]) # 3. Feature merging X_train_proc = np.hstack((encoded_tr_feat, scaled_tr_feat)) X_test_proc = np.hstack((encoded_te_feat, scaled_te_feat)) return X_train_proc, X_test_proc