- linear regression model
The univariate linear regression model uses a single feature to predict the response value, and the best fitted curve is obtained by minimizing the error between the predicted value and the real value.
Multiple regression model uses multiple independent variables to estimate dependent variables, so as to explain and predict the value of dependent variables
Advantages: simple model. Convenient deployment, regression weight can be used for result analysis and fast training
Disadvantages: low accuracy and collinearity of features
Tips: normalization processing, such as feature selection, is required to avoid the simultaneous existence of highly correlated features
from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression clf = LinearRegression() clf.fit(train_data, train_target) test_pred = clf.predict(test_data) score = mean_squared_error(test_pred, test_target) score
- K-nearest neighbor regression model
K-nearest neighbor regression model finds k nearest neighbors of a sample and assigns the average value of an attribute of these neighbors to the sample to obtain the value of the corresponding attribute of the sample.
Advantages: the model is simple, easy to understand, convenient and fast for small amount of data and convenient visualization
Disadvantages: large amount of calculation, not suitable for large amount of data, and parameters need to be adjusted
Tips: features need to be normalized, and important features can be appropriately weighted in a certain proportion
from sklearn.neighbors import KNeighborsRegressor clf = KNeighborsRegressor(n_neighbors=3) clf.fit(train_data, train_target) test_pred = clf.predict(test_data) score = mean_squared_error(test_pred, test_target) score
- Decision tree regression model
Decision tree regression can be understood as dividing a space into several subspaces according to certain rules, and then using the information of all points in the subspace to represent the subspace. For the test data, as long as it is classified into a subspace according to the characteristics, the output value of the corresponding subspace can be obtained.
from sklearn.tree import DecisionTreeRegressor clf = DecisionTreeRegressor() clf.fit(train_data, train_target) test_pred = clf.predict(test_data) score = mean_squared_error(test_pred, test_target) score
- Stochastic Forest regression model
Random forest is an algorithm that integrates multiple trees through the idea of integrated learning. The basic unit is the decision tree. In the regression problem, the random forest outputs the average output of all decision trees. The main advantage of random forest regression model is that in all algorithms, it has excellent accuracy, can run on large data sets, and can process input samples with high-dimensional characteristics, Moreover, it does not need to reduce the dimension, and can evaluate the importance of each feature in the classification problem. In the generation process, it can obtain an unbiased estimation of the internal generation error, and can also obtain good results for the default value.
Advantages: easy to use, features do not need to do too much transformation, high precision, model parallel efficiency
Disadvantages: the results are not easy to explain
Tips: adjust parameters to improve accuracy
from sklearn.ensemble import RandomForestRegressor clf = RandomForestRegressor(n_estimators=100) clf.fit(train_data, train_target) test_pred = clf.predict(test_data) score = mean_squared_error(test_target, test_pred) score
- LightGBM regression model
LightGBM supports efficient parallel training, with faster training speed, lower memory consumption, better accuracy, distributed support, and can quickly process massive data
Advantages: high precision
Disadvantages: long training time and complex model
Tips: effective validation set, prevention of over fitting, parameter search
import lightgbm as lgb clf = lgb.LGBMRegressor( learning_rate = 0.01, max_depth = -1, n_estimators = 5000, boosting_type='gbdt', random_state=2019, objective='regression' ) test_pred = clf.fit(train_data, train_target, eval_metric='MSE', verbose=50) score = mean_squared_error(test_target, test_pred) score
Ridge regression
LASSO regression
Gradient lifting tree regression