A case study of machine learning in Boston area

The goal of regression problem prediction is continuous variable

data description

# Import Boston house price data reader from sklearn.datasets
from sklearn.datasets import load_boston

# Read the house price data from and store it in the variable boston
boston = load_boston

# Output data description
boston.DESCR

Number of Instances: 506
Number of Attributes: 13 numeric/categorical predictive.Median Value (attribute 14) is usually the target.
Missing Attribute Values: None
It can be seen from the above that there are 506 data of housing prices in Boston area, each of which includes 13 numerical characteristic descriptions and target housing prices (average). In addition, there are no missing attribute / characteristic values in the data

data processing

from sklearn.model_selection import train_test_split
import numpy as np

X = boston.data
y = boston.target

# Randomly sampling 25% of the data to build test samples, the rest as training samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

# Analyze the difference of regression target value
print("The max target value is", np.max(boston.target))
print("The min target value is", np.min(boston.target))
print("The average target value is", np.mean(boston.target))

In the above data exploration, we can find that there is a large difference between the predicted target house prices, so we need to standardize the characteristics and target values

# Import data standardization module from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler

# Initialize the standardizer for feature and target values respectively
ss_X = StandardScaler()
ss_y = StandardScaler()

# Standardize the characteristics and target values of training and test data respectively
X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)
y_train = ss_y.fit_transform(y_train.reshape(-1,1))
y_test = ss_y.transform(y_test.reshape(-1,1))

Standardized training target set

Standardized verification target set

Try a linear model

Here we try to use the linear regression model LinearRegression and sgdregger

# Import LinearRegression from sklearn.linear? Model
from sklearn.linear_model import LinearRegression

# Initializing the linear regression with the default configuration
lr = LinearRegression()

# Parameter estimation using training data
lr.fit(X_train, y_train)

# Regression prediction of test data
lr_y_predict = lr.predict(X_test)

# Import sgdregger from sklearn.linear'u model
from sklearn.linear_model import SGDRegressor

# Initializing the linear regression sgdregger with the default configuration
sgdr = SGDRegressor()

# Parameter estimation using training data
sgdr.fit(X_train, y_train)

# Regression prediction of test data
sgdr_y_predict = sgdr.predict(X_test)

Linear model evaluation

By mean absolute error, mean square error, R-squared evaluation model
Evaluation of LinearRegression

# Use the evaluation module of LinearRegression model to output the evaluation results
print('The value of default measurement of LinearRegression is', lr.score(X_test, y_test))

# From sklearn.metrics, import R2 ﹐ score, mean ﹐ squared ﹐ error and mean ﹐ absolute ﹐ error for regression performance evaluation
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Use R2? Score module and output the evaluation results
print('The value of R-squared of LinearRegression is', r2_score(y_test, lr_y_predict))

# Use the mean squared error module and output the evaluation results
print('The mean squared error of LinearRegression is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict)))

# Use the mean absolute error module and output the evaluation results
print('The mean absolute error of LinearRegression is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict)))

Sgdregger evaluation

# Use the evaluation module of sgdregger model to output the evaluation results
print('The value of default measurement of SGDRegression is', sgdr.score(X_test, y_test))

# From sklearn.metrics, import R2 ﹐ score, mean ﹐ squared ﹐ error and mean ﹐ absolute ﹐ error for regression performance evaluation
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Use the R2? Score module and output the evaluation results
print('The value of R-squared of SGDRegressor is', r2_score(y_test, sgdr_y_predict))

# Use the mean squared error module and output the evaluation results
print('The mean squared error of SGDRegressor is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict)))

# Use the mean absolute error module and output the evaluation results
print('The mean absolute error of SGDRegressor is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict)))

It can be seen that the scoring function of the model is the R-squared index
Sgdregger can save a lot of computing time without losing too much performance when it faces the task of huge training data scale. According to the suggestions of scikit learn website, if the data scale is more than 100000, it is recommended to use the random gradient method to estimate the parameter model

Try support vector machine model

Continue to use the segmented and processed training data and test data
Try three kinds of support vector machine models with different kernel functions

# Importing support vector machine (regression) model from sklearn.svm
from sklearn.svm import SVR

# Support vector machine with linear kernel function is used for regression training, and test samples are predicted
linear_svr = SVR(kernel='linear')
linear_svr.fit(X_train, y_train)
linear_svr_y_predict = linear_svr.predict(X_test)

# Support vector machine with polynomial kernel function is used for regression training, and test samples are predicted
poly_svr = SVR(kernel='poly')
poly_svr.fit(X_train, y_train)
poly_svr_y_predict = poly_svr.predict(X_test)

# Support vector machine with radial vector kernel function is used for regression training, and test samples are predicted
rbf_svr = SVR(kernel='rbf')
rbf_svr.fit(X_train, y_train)
rbf_svr_y_predict = rbf_svr.predict(X_test)

Model assessment

Linear kernel support vector machine

# Import R-squared, MSE and MAE from sklearn.metrics for regression performance evaluation
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Output evaluation results
print('R-squared value of linear SVR is', linear_svr.score(X_test, y_test))

# Use the mean squared error module and output the evaluation results
print('The mean squared error of linear SVR is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(linear_svr_y_predict)))

# Use the mean absolute error module and output the evaluation results
print('The mean absolute error of linear SVR is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(linear_svr_y_predict)))

Polynomial kernel support vector machine

# Output evaluation results
print('R-squared value of poly SVR is', poly_svr.score(X_test, y_test))

# Use the mean squared error module and output the evaluation results
print('The mean squared error of poly SVR is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(poly_svr_y_predict)))

# Use the mean absolute error module and output the evaluation results
print('The mean absolute error of poly SVR is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(poly_svr_y_predict)))

Radial vector kernel function support vector machine

# Output evaluation results
print('R-squared value of rbf SVR is', rbf_svr.score(X_test, y_test))

# Use the mean squared error module and output the evaluation results
print('The mean squared error of rbf SVR is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rbf_svr_y_predict)))

# Use the mean absolute error module and output the evaluation results
print('The mean absolute error of rbf SVR is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rbf_svr_y_predict)))

Try parameterless model

K-nearest neighbor regression model with two different configurations

# Import KNeighborRegressor from sklearn.neighbors
from sklearn.neighbors import KNeighborsRegressor

# Initialize K-nearest neighbor regression and adjust the configuration so that the prediction method is average regression: weight = uniform '
uni_knr = KNeighborsRegressor(weights='uniform')
uni_knr.fit(X_train, y_train)
uni_knr_y_predict = uni_knr.predict(X_test)

# Initialize the K nearest neighbor regression and adjust the configuration so that the prediction mode is average regression: weight='distance'
dis_knr = KNeighborsRegressor(weights='distance')
dis_knr.fit(X_train, y_train)
dis_knr_y_predict = dis_knr.predict(X_test)

Nonparametric model evaluation

Average regression K-nearest neighbor model

# Using R-squared.MSE and MAE to evaluate the performance of K-nearest neighbor model with average regression configuration on test set
print('R-squared value of uniform-weighted KNeighborRegression:', uni_knr.score(X_test, y_test))
print('The mean squared error of uniform-weighted KNeighborRegression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(uni_knr_y_predict)))
print('The mean absoluate error of uniform-weighted KNeighborRegression:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(uni_knr_y_predict)))

Distance weighted regression K-nearest neighbor model

# Using R-squared, MSE and MAE to evaluate the performance of K-nearest-neighbor model configured by distance weighted regression on test set
print('R-squared value of distance-weighted KNeighborRegression:', dis_knr.score(X_test, y_test))
print('The mean squared error of distance-weighted KNeighborRegression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dis_knr_y_predict)))
print('The mean absoluate error of distance-weighted KNeighborRegression:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dis_knr_y_predict)))

Try regression tree model

The leaf node of the regression tree returns the mean value of "one mass" training data, rather than specific and continuous prediction values

# Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor

# Initializing DecisionTreeRegressor with default configuration
dtr = DecisionTreeRegressor()

# Building regression tree with Boston house price training data
dtf.fit(X_train, y_train)

# The test data is predicted using a single regression tree configured by default, and the predicted value is stored in the variable DTR ﹣ y ﹣ predict
dtr_y_predict = dtr.predict(X_test)

Regression tree model evaluation

# Using R-squared, MSE, and MAE metrics to evaluate the performance of the regression tree of the default configuration on the test set
print('R-squared value of DecisionTreeRegressor:', dtr.score(X_test, y_test))
print('The mean squared error of DecisionTreeRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dtr_y_predict)))
print('The mean absolute error of DecisionTreeRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dtr_y_predict)))

Characteristic analysis

Tree model can solve the problem of nonlinear characteristics
The tree model does not require the standardization and unified quantification of features, that is, both numerical and category features can be directly applied to the construction and prediction of tree model
The tree model can also output the decision-making process intuitively, which makes the prediction result interpretable

At the same time, the tree model has some obvious defects

Because the tree model can solve the complex nonlinear fitting problem, it is easier to lose the accuracy of new data prediction because the model building is too complex
The prediction process of tree model from top to bottom will change greatly due to the slight change of data, so the prediction stability is poor
It is NP hard to build the best tree model based on the training data, that is to say, we can't find the best solution in a limited time, so we can only find some suboptimal solutions with the similar greedy algorithm, which is why we often find higher model performance in multiple suboptimal solutions with the help of integrated model

Try to integrate the model

Extreme random forest, different from the general random forest model, does not randomly select features when constructing a tree's split nodes; instead, it first randomly collects some features, and then uses information entropy and Gini impure to select the best node features
Three models, RandomForestRegressor, ExtraTreesRegressor and gradientboosting regressor, have been tried

# Import RandomForestRegressor, ExtraTreesGressor, and gradientboosting regressor from sklearn.ensemble
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
# The RandomForestRegressor training model is used to predict the test data, and the results are stored in the variable RFR ﹣ y ﹣ predict
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
rfr_y_predict = rfr.predict(X_test)

# The ExtraTreesRegressor training model is used to predict the test data, and the results are stored in the variable ETR ﹤ predict
etr = ExtraTreesRegressor()
etr.fit(X_train, y_train)
etr_y_predict = etr.predict(X_test)

# Use the gradientboosting regression training model, and make a prediction of the test data. The results are stored in the variable GBR ﹣ y ﹣ predict
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
gbr_y_predict = gbr.predict(X_test)

Integrated model assessment

Random regression forest

# Using R-squared.MSE and MAE indicators to evaluate the performance of the random regression forest of the default configuration on the test set
print('R-squared value of RandomForestRegressor:', rfr.score(X_test, y_test))
print('The mean squared error of RandomForestRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
print('The mean absoluate error of RandomForestRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))

Extreme return to forest

# Using R-squared.MSE and MAE index to evaluate the performance of extreme regression forest on test set
print('R-squared value of ExtraTreesRegressor:', etr.score(X_test, y_test))
print('The mean squared error of ExtraTreesRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(etr_y_predict)))
print('The mean absoluate error of ExtraTreesRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(etr_y_predict)))

# Using the trained extreme regression forest model, output the contribution of each feature to the prediction target
print(np.sort(list(zip(etr.feature_importances_, boston.feature_names)), axis=0))

Gradient regression forest

# Using R-squared.MSE and MAE index to evaluate the performance of the default forest on the test set
print('R-squared value of GrandientBoostingRegressor:', gbr.score(X_test, y_test))
print('The mean squared error of GrandientBoostingRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(gbr_y_predict)))
print('The mean absoluate error of GrandientBoostingRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(gbr_y_predict)))

It's not hard to see that integration models often provide better performance and stability

Schrodinger's Turing machine_

Published 7 original articles, won praise 1, visited 109

Private letter follow

Posted by archbeta on Mon, 10 Feb 2020 07:53:24 -0800

Programmer Group