Regression and correlation model

Keywords: Machine Learning sklearn

  1. linear regression model
    The univariate linear regression model uses a single feature to predict the response value, and the best fitted curve is obtained by minimizing the error between the predicted value and the real value.

Multiple regression model uses multiple independent variables to estimate dependent variables, so as to explain and predict the value of dependent variables

Advantages: simple model. Convenient deployment, regression weight can be used for result analysis and fast training
Disadvantages: low accuracy and collinearity of features
Tips: normalization processing, such as feature selection, is required to avoid the simultaneous existence of highly correlated features

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(train_data, train_target)
test_pred = clf.predict(test_data)
score = mean_squared_error(test_pred, test_target)
score
  1. K-nearest neighbor regression model
    K-nearest neighbor regression model finds k nearest neighbors of a sample and assigns the average value of an attribute of these neighbors to the sample to obtain the value of the corresponding attribute of the sample.
    Advantages: the model is simple, easy to understand, convenient and fast for small amount of data and convenient visualization
    Disadvantages: large amount of calculation, not suitable for large amount of data, and parameters need to be adjusted
    Tips: features need to be normalized, and important features can be appropriately weighted in a certain proportion
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor(n_neighbors=3)
clf.fit(train_data, train_target)
test_pred = clf.predict(test_data)
score = mean_squared_error(test_pred, test_target)
score
  1. Decision tree regression model
    Decision tree regression can be understood as dividing a space into several subspaces according to certain rules, and then using the information of all points in the subspace to represent the subspace. For the test data, as long as it is classified into a subspace according to the characteristics, the output value of the corresponding subspace can be obtained.
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf.fit(train_data, train_target)
test_pred = clf.predict(test_data)
score = mean_squared_error(test_pred, test_target)
score
  1. Stochastic Forest regression model
    Random forest is an algorithm that integrates multiple trees through the idea of integrated learning. The basic unit is the decision tree. In the regression problem, the random forest outputs the average output of all decision trees. The main advantage of random forest regression model is that in all algorithms, it has excellent accuracy, can run on large data sets, and can process input samples with high-dimensional characteristics, Moreover, it does not need to reduce the dimension, and can evaluate the importance of each feature in the classification problem. In the generation process, it can obtain an unbiased estimation of the internal generation error, and can also obtain good results for the default value.
    Advantages: easy to use, features do not need to do too much transformation, high precision, model parallel efficiency
    Disadvantages: the results are not easy to explain
    Tips: adjust parameters to improve accuracy
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(n_estimators=100)
clf.fit(train_data, train_target)
test_pred = clf.predict(test_data)
score = mean_squared_error(test_target, test_pred)
score
  1. LightGBM regression model
    LightGBM supports efficient parallel training, with faster training speed, lower memory consumption, better accuracy, distributed support, and can quickly process massive data
    Advantages: high precision
    Disadvantages: long training time and complex model
    Tips: effective validation set, prevention of over fitting, parameter search
import lightgbm as lgb
clf = lgb.LGBMRegressor(
    learning_rate = 0.01,
    max_depth = -1,
    n_estimators = 5000,
    boosting_type='gbdt',
    random_state=2019,
    objective='regression'
)

test_pred = clf.fit(train_data, train_target, eval_metric='MSE', verbose=50)
score = mean_squared_error(test_target, test_pred)
score

Ridge regression

LASSO regression

Gradient lifting tree regression

Posted by sklein99 on Mon, 04 Oct 2021 13:51:20 -0700