Linear models use linear functions of input characteristics to predict, and learn the difference between algorithms of linear models.
(1) The specific combination of coefficients and intercepts is used to measure the fitness of training data. Different algorithms use different methods to measure the fitness of training set, which is called loss function.
(2) Whether to use regularization and which regularization method to use
The main parameter of linear model is regularization parameter. If only a few features are really important, L1 regularization should be used, otherwise L2 regularization should be used by default.
When dealing with large data, you need to study the solver='sag'option using Logistic Regression and Ridge models, which is faster than the default value.
Linear model for regression
y=wi∗xi+by=w_i*x_i + by=wi∗xi+b
xix_ixi is the feature of a single data point, wiw_i wi is the slope of each characteristic coordinate axis or the weighting of input features, wi and bw_i and b wi and b are the parameters of the learning model, YY is the result of model prediction.
The learning parameters w0 and B w_0 and B w_0 and b:
import mglearn mglearn.plots.plot_linear_regression_wave()
Linear Regression (Ordinary Least Square Method)
Linear regression search parameters w and b W and b W and b are the minimum mean square error between the predicted value of training set and the real regression target yyy. Linear regression search parameters w and b W and b are the minimum mean square error between the predicted value of training set and the real regression target yy.
Mean Square Error: The sum of squares of the difference between the predicted value and the real value divided by the number of samples
#Prediction of wave Data Set by Linear Regression from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split import mglearn X,y = mglearn.datasets.make_wave(n_samples = 60) X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42) lr = LinearRegression().fit(X_train,y_train) #The sklearn library always keeps the values from the training data in the attributes at the end of the underline, distinguishing them from the parameters set by the user. print('lr.coef_:{}'.format(lr.coef_)) print('lr.intercept_:{}'.format(lr.intercept_)) #If the scores on the training set and the test set are very close, there may be under-fitting. print('Training set score:{:.2f}'.format(lr.score(X_train,y_train))) print('Test set score:{:.2f}'.format(lr.score(X_test,y_test)))
#Linear Regression Performance on High-Dimensional Data Set, Boston House Price Data Set from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split import mglearn X,y = mglearn.datasets.load_extended_boston() X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0) lr = LinearRegression().fit(X_train,y_train) #The performance difference between training set and test set is an obvious sign of over-fitting. print('Training set score:{:.2f}'.format(lr.score(X_train,y_train))) print('Test set score:{:.2f}'.format(lr.score(X_test,y_test)))
ridge regression
The prediction formula of ridge regression is the same as that of ordinary least squares method, but the L2 regularization constraint is used in ridge regression to minimize the influence of each feature on the output. A larger alpha represents a more constrained model, and it is expected that the coef_elements corresponding to a larger alpha are smaller than those corresponding to a smaller alpha.
#Ridge's performance on high-dimensional data sets, Boston House Price Data Set from sklearn.linear_model import Ridge from sklearn.model_selection import train_test_split import mglearn X,y = mglearn.datasets.load_extended_boston() X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0) ridge = Ridge().fit(X_train,y_train) #Rigde scored lower in the training set than in the Linear Regression, but higher in the test set. #Linear model is over-fitting to data, Ridge is a more constrained model, and it is not easy to over-fitting. print('Training set score:{:.2f}'.format(ridge.score(X_train,y_train))) print('Test set score:{:.2f}'.format(ridge.score(X_test,y_test)))
#Adjust alpha, increase alpha to make the coefficient tend to zero, reduce the performance of training set, possibly!!! Improve generalization performance #Ridge's performance on high-dimensional data sets, Boston House Price Data Set import matplotlib.pyplot as plt from sklearn.linear_model import Ridge from sklearn.model_selection import train_test_split import mglearn X,y = mglearn.datasets.load_extended_boston() X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0) #Default aloha = 1.0 ridge = Ridge().fit(X_train,y_train) print('Training set score:{:.2f}'.format(ridge.score(X_train,y_train))) print('Test set score:{:.2f}'.format(ridge.score(X_test,y_test))) #aplha = 10 ridge10 = Ridge(alpha = 10).fit(X_train,y_train) print('Training set score:{:.2f}'.format(ridge10.score(X_train,y_train))) print('Test set score:{:.2f}'.format(ridge10.score(X_test,y_test))) #aplha = 0.1 ridge01 = Ridge(alpha = 0.1).fit(X_train,y_train) print('Training set score:{:.2f}'.format(ridge01.score(X_train,y_train))) print('Test set score:{:.2f}'.format(ridge01.score(X_test,y_test))) #Target data points plt.plot(ridge.coef_,'s',label = "Ridge alpha = 1") plt.plot(ridge10.coef_,'^',label = "Ridge alpha = 10") plt.plot(ridge01.coef_,'v',label = "Ridge alpha = 0.1") plt.xlabel("Coefficient index") #The X-axis corresponds to the element of coef_and the x=i corresponds to the coefficient of the eighth feature. The y-axis represents the specific value of the coefficient. plt.ylabel("Coefficient magnitude") #Coefficient magnitude plt.hlines(0,0,len(ridge.coef_)) #Drawing abscissa plt.ylim(-25,25) #Setting the Maximum and Minimum Interval of the Coordinate Axis plt.legend(loc = 'best')
import mglearn #Fixed alpha value and changed the amount of training data #Two models, Linear Regression and Ridge(alpha = 1), are evaluated on a growing subset of Boston house price data by double sampling. #learning curve mglearn.plots.plot_ridge_n_samples() #The training performance of linear regression is declining #If there is enough data, regularization becomes less important.
Lasso
Lasso is also used to constrain its coefficients to close to 0, but different methods are used. The result of L1 regularization is that some coefficients are just 0 when Lasso is used, which can be regarded as automatic feature selection.
#Applying Lasso to Extended Boston House Price Data Set from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split import mglearn import numpy as np import matplotlib.pyplot as plt X,y = mglearn.datasets.load_extended_boston() X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0) lasso = Lasso().fit(X_train,y_train) #The performance of training set and test set is very poor, indicating that there is under-fitting. print('Training set score:{:.2f}'.format(lasso.score(X_train,y_train))) print('Test set score:{:.2f}'.format(lasso.score(X_test,y_test))) print('Number of features used: {}'.format(np.sum(lasso.coef_ != 0))) #The Number of Features with a Display Coefficient of Not 0 #Lasso also has a regularization parameter alpha, default = 1.0, and the strength of the control coefficient tends to zero. Reduce underfitting, reduce alpha, and increase max_iter (maximum number of iterations run) #Fitting out a more complex model lasso001 = Lasso(alpha = 0.01, max_iter = 100000).fit(X_train,y_train) print('Training set score:{:.2f}'.format(lasso001.score(X_train,y_train))) print('Test set score:{:.2f}'.format(lasso001.score(X_test,y_test))) print('Number of features used: {}'.format(np.sum(lasso001.coef_ != 0))) #But setting alpha too small will eliminate the effect of regularization and result in over-fitting. lasso00001 = Lasso(alpha = 0.0001, max_iter = 100000).fit(X_train,y_train) print('Training set score:{:.2f}'.format(lasso00001.score(X_train,y_train))) print('Test set score:{:.2f}'.format(lasso00001.score(X_test,y_test))) print('Number of features used: {}'.format(np.sum(lasso00001.coef_ != 0))) plt.plot(lasso.coef_,'s',label = 'Lasso alpha = 1') plt.plot(lasso001.coef_,'^',label = 'Lasso alpha = 0.01') plt.plot(lasso00001.coef_,'v',label = 'Lasso alpha = 0.0001') plt.xlabel('Coefficient index') plt.ylabel('Coefficient magnitude') plt.legend(ncol = 2,loc = (0,1.05))#Number of columns is 2 plt.ylim(-25,25)
Sklearns provides the ElasticNet class, which combines Lasso and Ridge penalties to adjust two parameters: L1 regularization and L2 regularization.
Linear Model for Classification
y=wi∗xi+b>0y=w_i*x_i + b>0y=wi∗xi+b>0
Instead of returning the weighted evaluation of the feature, a threshold value (0):y<0 is set for the prediction, then the prediction category-1; y>0 and the prediction Category-1 are set. For linear models used for classification, decision boundary is a linear function of input, that is, linear classifier is a classifier that uses straight line, plane or hyperplane to separate two categories.
#Two kinds of linear classification models are applied to forge data set and decision boundary is visualized. from sklearn.linear_model import LogisticRegression #Logistic regression from sklearn.svm import LinearSVC #Linear Support Vector Machine import mglearn X,y = mglearn.datasets.make_forge() fig,axes = plt.subplots(1,2,figsize = (10,3)) for model,ax in zip([LinearSVC(), LogisticRegression()],axes): clf = model.fit(X,y) #The alpha parameter represents the depth of the boundary color mglearn.plots.plot_2d_separator(clf, X, fill = False, eps = 0.5, ax = ax, alpha = 0.7) #Visualization of decision boundary mglearn.discrete_scatter(X[:,0],X[:,1],y,ax = ax) #Draw points ax.set_title("{}".format(clf.__class__.__name__)) ax.set_xlabel("Feature 0") ax.set_ylabel("Feature 1") ax.legend(loc = "best")
Logistic Regression and Linear SVC models use L2 regularization by default, and the weighting parameter that determines the intensity of regularization is called C. The larger the value of c, the weaker the regularization.
#Decision Boundary of Linear SVM with Different c Values on forge Data Set import mglearn mglearn.plots.plot_linear_svc_regularization()
In high-dimensional space, the linear model for classification is very powerful. When considering too many features, it is more and more important to avoid over-fitting.
#Detailed Analysis of Logistic Regression on Breast Cancer Data Set from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt cancer = load_breast_cancer() X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target,stratify = cancer.target,random_state = 42) #C = 1.0 logreg = LogisticRegression().fit(X_train,y_train) print("Training set score:{:.2f}".format(logreg.score(X_train,y_train))) print("Test set score:{:.3f}".format(logreg.score(X_test,y_test))) #C = 100 logreg100 = LogisticRegression(C = 100).fit(X_train,y_train) print("Training set score:{:.2f}".format(logreg100.score(X_train,y_train))) print("Test set score:{:.3f}".format(logreg100.score(X_test,y_test))) #C = 0.01 logreg001 = LogisticRegression(C = 0.01).fit(X_train,y_train) print("Training set score:{:.2f}".format(logreg001.score(X_train,y_train))) print("Test set score:{:.3f}".format(logreg001.score(X_test,y_test))) plt.plot(logreg.coef_.T,'o',label = "C = 1") plt.plot(logreg100.coef_.T,'^',label = "C = 100") plt.plot(logreg001.coef_.T,'v',label = "c = 0.01") plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90) plt.hlines(0,0,cancer.data.shape[1]) plt.ylim(-5,5) plt.xlabel("Coefficient index") plt.ylabel("Coefficient magnitude") plt.legend() #The system can tell us which category a feature is related to. #Logistic Regression Using L1 Regularization for C, marker in zip([0.001,1,100],['o','^','v']): lr_l1 = LogisticRegression(C = C, penalty = "l1").fit(X_train,y_train) print("Training accuracy of l1 logreg with C ={:.3f}:{:.2f}".format(C,lr_l1.score(X_train,y_train))) print("Test accuracy of l1 logreg with C ={:.3f}:{:.2f}".format(C,lr_l1.score(X_test,y_test))) plt.plot(lr_l1.coef_.T,marker,label = "C={:.3f}".format(C)) plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90) plt.hlines(0,0,cancer.data.shape[1]) plt.ylim(-5,5) plt.xlabel("Coefficient index") plt.ylabel("Coefficient magnitude") plt.legend(loc = 3)
The penalty parameter of the model influences regularization, that is, whether the model uses all available features or only selects a subset of features.
Linear Model for Multi-Classification
Many linear classification models are only applicable to binary classification problems, and can not be easily extended to multi-classification problems. One common method of extending binary classification algorithms to multi-classification algorithms is "one to the rest". "One to the rest", that is, to learn a binary classification model for each category, and to separate this category from other categories. Each category corresponds to a binary classifier so that each category has a coefficient w vector and an intercept b. The category corresponding to the maximum value of the result is the predicted category label.
#Two-dimensional Toy Data Set with 3 Categories from sklearn.datasets import make_blobs import mglearn import matplotlib.pyplot as plt X,y = make_blobs(random_state = 42) mglearn.discrete_scatter(X[:,0],X[:,1],y) #Training a Linear SVC classifier linear_svm = LinearSVC().fit(X,y) print("Coefficient shape:", linear_svm.coef_.shape) #Three lines, two features print("Intercept shape:", linear_svm.intercept_.shape) line = np.linspace(-15,15) for coef,intercept,color in zip(linear_svm.coef_, linear_svm.intercept_,['b','r','g']): plt.plot(line, -(line * coef[0] + intercept) / coef[1], c = color) plt.ylim(-10,15) plt.xlim(-10,8) plt.xlabel("Feature 0") plt.ylabel("Feature 1") plt.legend(["Class 0","Class 1","Class 2","Line class 0","Line class 1","Line class 2"], loc = (1.01,0.3)) mglearn.plots.plot_2d_classification(linear_svm, X, fill = True, alpha = .7) #Visualization of boundary conditions
Doubts about Code and Method
train_test_split(X, y, stratify=y)
https://blog.csdn.net/weixin_37226516/article/details/62042550
Ordinary Least Square Method (OLS)
https://blog.csdn.net/enjoy524/article/details/53556038
Python's knowledge point plt.plot() function details
https://blog.csdn.net/cjcrxzz/article/details/79627483
Setting of coordinate interval of Matplotlib's coordinate axis in python
https://blog.csdn.net/ccy950903/article/details/50688449
Matrix theory: vector norm and matrix norm
https://blog.csdn.net/pipisorry/article/details/51030563
Regularization and Understanding of Regularization Terms
https://blog.csdn.net/gshgsh1228/article/details/52199870
Deep Learning-L0, L1 and L2 Norms
https://blog.csdn.net/zchang81/article/details/70208061
Machine learning - sklearn.Lasso
https://www.jianshu.com/p/1177a0bcb306