Machine learning -- linear discriminant criterion (LDA) and linear classification algorithm (SVM)

1, Introduction to LDA and SVM

Linear Discriminant Analysis (LDA) is a classical supervised data dimensionality reduction method. The main idea of LDA is to project the data in a high-dimensional space into a low-dimensional space, and after the projection, it is necessary to ensure that the intra class variance of each category is small and the mean difference between categories is large, which means that after the high-dimensional data of the same category are projected into a low-dimensional space, the same categories gather together, but the different categories are far apart.
In machine learning, support vector machine (SVM) is a supervised learning model with related learning algorithms, which analyzes the data used for classification and regression analysis. Given a set of training examples, each example is marked as belonging to one or the other of the two categories, the SVM training algorithm constructs a model, assigns the new example to one category or the other category, and makes it a non probabilistic binary linear classifier. The SVM model represents the examples as points in space, and the mapping makes the examples of individual categories divided by the explicit gap as wide as possible. Then map the new examples to the same space and predict which edge they belong to a category.

2, LDA implementation code

Import the package you want to use

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda#Import LDA Algorithm
from sklearn.datasets._samples_generator import make_classification #Import classification builder
import matplotlib.pyplot as plt #Import tools for drawing
import numpy as np
import pandas as pd

Get the data set and train with the make imported above_ The classification function obtains the dataset

x,y=make_classification(n_samples=500,n_features=2,n_redundant=0,n_classes=2,n_informative=1,n_clusters_per_class=1,class_sep=0.5,random_state=100)
"""
n_features :Number of features= n_informative() + n_redundant + n_repeated
n_informative: Number of multi information features
n_redundant: Redundant information, informative Random linear combination of features
n_repeated : Duplicate information, random extraction n_informative and n_redundant features
n_classes: Classification category
n_clusters_per_class : A category consists of several cluster Constitutive

"""
plt.scatter(x[:,0],x[:,1], marker='o', c=y)
plt.show()
x_train=x[:60, :60]
y_train=y[:60]
x_test=x[40:, :]
y_test=y[40:]

The data set is divided into training set and test set, and the classification ratio is 6:4. After training, the accuracy is obtained by using the test set

#It is divided into training set and test set for model training and testing
x_train=x[:300, :300]
y_train=y[:300]
x_test=x[200:, :]
y_test=y[200:]
lda_test=lda()
lda_test.fit(x_train,y_train)
predict_y=lda_test.predict(x_test)#Get predicted results
count=0
for i in range(len(predict_y)):
    if predict_y[i]==y_test[i]:
        count+=1
print("The number of accurate forecasts is"+str(count))
print("The accuracy is"+str(count/len(predict_y)))

3, SVM dataset for visual classification

1. Linear kernel

Import package

# Importing moon dataset and svm method
#This is linear svm
from sklearn import datasets #Import dataset
from sklearn.svm import LinearSVC #Import linear svm
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import StandardScaler

Obtain data

data_x,data_y=datasets.make_moons(noise=0.15,random_state=777)#Generate moon dataset
# random_state is a random seed and nosie is a square
plt.scatter(data_x[data_y==0,0],data_x[data_y==0,1])
plt.scatter(data_x[data_y==1,0],data_x[data_y==1,1])
data_x=data_x[data_y<2,:2]#Data only_ Y is less than 2, and only the first two features are taken
plt.show()

Standardize and train data

scaler=StandardScaler()# Standardization
scaler.fit(data_x)#Calculate the mean and variance of training data
data_x=scaler.transform(data_x) #Then use the mean and variance in scaler to convert X and standardize X
liner_svc=LinearSVC(C=1e9,max_iter=100000)#For linear svm classifier, iter is the number of iterations, and the value of c determines the fault tolerance. The larger c is, the smaller the fault tolerance is
liner_svc.fit(data_x,data_y)

As shown in the figure, there will be warnings, but there are still results. Ignore the warnings.

Write a boundary drawing function to prepare for the following visual classification

# Boundary drawing function
def plot_decision_boundary(model,axis):
    x0,x1=np.meshgrid(
        np.linspace(axis[0],axis[1],int((axis[1]-axis[0])*100)).reshape(-1,1),
        np.linspace(axis[2],axis[3],int((axis[3]-axis[2])*100)).reshape(-1,1))
    # The meshgrid function returns a coordinate matrix from a coordinate vector
    x_new=np.c_[x0.ravel(),x1.ravel()]
    y_predict=model.predict(x_new)#Get predicted value
    zz=y_predict.reshape(x0.shape)
    custom_cmap=ListedColormap(['#EF9A9A','#FFF59D','#90CAF9'])
    plt.contourf(x0,x1,zz,cmap=custom_cmap)

Drawing and output parameter weight and model intercept

#Draw and display parameters and intercept
plot_decision_boundary(liner_svc,axis=[-3,3,-3,3])
plt.scatter(data_x[data_y==0,0],data_x[data_y==0,1],color='red')
plt.scatter(data_x[data_y==1,0],data_x[data_y==1,1],color='blue')
plt.show()
print('Parameter weight')
print(liner_svc.coef_)
print('Model intercept')
print(liner_svc.intercept_)

2. Polynomial kernel

Import package with pipeline kernel polynomial regression

#This is polynomial kernel svm
from sklearn import datasets #Import dataset
from sklearn.svm import LinearSVC #Import linear svm
from sklearn.pipeline import Pipeline #Import pipes in python
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler,PolynomialFeatures #Import polynomial regression and standardization

The generated data, which is also the moon data set, is consistent with linear svm

data_x,data_y=datasets.make_moons(noise=0.15,random_state=777)#Generate moon dataset
# random_state is a random seed and nosie is a square
plt.scatter(data_x[data_y==0,0],data_x[data_y==0,1])
plt.scatter(data_x[data_y==1,0],data_x[data_y==1,1])
data_x=data_x[data_y<2,:2]#Data only_ Y is less than 2, and only the first two features are taken
plt.show()

pipeline is used for integrated programming. For convenience, it is put into the function

 def PolynomialSVC(degree,c=10):#Polynomial svm
    return Pipeline([
            # Mapping source data to third-order polynomials
            ("poly_features", PolynomialFeatures(degree=degree)),
            # Standardization
            ("scaler", StandardScaler()),
            # SVC linear classifier
            ("svm_clf", LinearSVC(C=10, loss="hinge", random_state=42,max_iter=10000))
        ])

Model training and drawing

# Model training and drawing
poly_svc=PolynomialSVC(degree=3)
poly_svc.fit(data_x,data_y)
plot_decision_boundary(poly_svc,axis=[-1.5,2.5,-1.0,1.5])#Draw boundary
plt.scatter(data_x[data_y==0,0],data_x[data_y==0,1],color='red')#Draw point
plt.scatter(data_x[data_y==1,0],data_x[data_y==1,1],color='blue')
plt.show()
print('Parameter weight')
print(poly_svc.named_steps['svm_clf'].coef_)
print('Model intercept')
print(poly_svc.named_steps['svm_clf'].intercept_)

3. Gaussian kernel

Import package

## Import package
from sklearn import datasets #Import dataset
from sklearn.svm import SVC #Import svm
from sklearn.pipeline import Pipeline #Import pipes in python
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler#Import standardization

Define SVM Gaussian model

def RBFKernelSVC(gamma=1.0):
    return Pipeline([
        ('std_scaler',StandardScaler()),
        ('svc',SVC(kernel='rbf',gamma=gamma))
    ])

Carry out model training and draw graphics. The gamma parameter is also very important. The larger the gamma parameter is, the smaller the support vector is, which is similar to C. changing the gamma value will change the judgment area

svc=RBFKernelSVC(gamma=100)#Gamma parameter is very important. The larger the gamma parameter, the smaller the support vector
svc.fit(data_x,data_y)
plot_decision_boundary(svc,axis=[-1.5,2.5,-1.0,1.5])
plt.scatter(data_x[data_y==0,0],data_x[data_y==0,1],color='red')#Draw point
plt.scatter(data_x[data_y==1,0],data_x[data_y==1,1],color='blue')
plt.show()

4, Summary

We have a preliminary understanding and practice of LDA and SVM, and our understanding of the role is a little deeper than before. LDA projects all points onto a straight line, and then looks for a classified straight line. SVM calculates the support vector near the boundary and confirms the boundary through the support vector. The understanding of the principle is still quite shallow, and we need to deepen the study of the principle

Reference link

https://blog.csdn.net/junseven164/article/details/121037899

Posted by catalin.1975 on Fri, 05 Nov 2021 18:35:04 -0700

Programmer Group