[machine learning note day22] 4.2. Linear regression case analysis + Boston house price forecast

Keywords: network

Article directory

4.2. Linear regression case analysis

Linear regression case analysis

Boston house price forecast

The regression model built in scikit learn is used to predict the "Boston house price" data. Some competition data can be obtained from the official website of kaggle, https://www.kaggle.com/datasets

1. Data description of housing prices in Boston

from sklearn.datasets import load_boston

boston = load_boston()

print boston.DESCR

2. Boston area housing price data segmentation

from sklearn.cross_validation import train_test_split
import numpy as np
X = boston.data
y = boston.target

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=33,test_size = 0.25)

3. Standardized processing of training and test data

from sklearn.preprocessing import StandardScaler
ss_X = StandardScaler()
ss_y = StandardScaler()

X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)
y_train = ss_X.fit_transform(y_train)
X_train = ss_X.transform(y_test)

4. Using the simplest linear regression model, linear regression and gradient decline to estimate sgdregger to predict house prices

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,y_train)
lr_y_predict = lr.predict(X_test)

from sklearn.linear_model import SGDRegressor
sgdr = SGDRegressor()
sgdr.fit(X_train,y_train)
sgdr_y_predict = sgdr.predict(X_test)

5. Performance evaluation

For different types of prediction, we can not strictly require that the numerical results of regression prediction should be strictly the same as the real values. In general, we want to measure the difference between the predicted value and the real value. Therefore, the evaluation function can be used for evaluation. Among them, the most intuitive evaluation index, mean squared error (MSE), is the objective of linear regression model optimization.

The calculation method of MSE is as follows:

{MSE=}\frac{1}{m}\sum_{i=1}{m}\left({y{i}-\bar{y}}\right)^{2}MSE=m1∑i=1m(y**i−y¯)2

Using MSE evaluation mechanism to evaluate the regression performance of the two models

from sklearn.metrics import mean_squared_error

print 'The mean square error of linear regression model is:',mean_squared_error(ss_y.inverse_transform(y_test),ss_y.inverse_tranform(lr_y_predict))
print 'The mean square error of gradient descent model is:',mean_squared_error(ss_y.inverse_transform(y_test),ss_y.inverse_tranform(sgdr_y_predict))

Through this comparison, it is found that the performance of the gradient descent estimation method is not as good as that of the linear regression method, but if the training data scale is very large, then the gradient method is very efficient in both classification and regression problems, which can save a lot of computing time without losing too much performance. According to the proposal of scikit learn optical network, if the data scale is more than 100000, the random gradient method is recommended to estimate the parameter model.

Note: linear regression is the simplest and most easy to use regression model. Because of the linear hypothesis between the feature and the regression target, it also limits its application scope to some extent. In particular, the vast majority of real-life case data can not guarantee strict linear relationship between various characteristics and regression objectives. However, we can still use linear regression model as the baseline system for most data analysis without knowing the relationship between features.

The complete code is as follows:

from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error,classification_report
from sklearn.cluster import KMeans


def linearmodel():
    """
    //Linear regression for Boston dataset processing
    :return: None
    """

    # 1. Load dataset

    ld = load_boston()

    x_train,x_test,y_train,y_test = train_test_split(ld.data,ld.target,test_size=0.25)

    # 2. Standardized treatment

    # Eigenvalue processing
    std_x = StandardScaler()
    x_train = std_x.fit_transform(x_train)
    x_test = std_x.transform(x_test)


    # Target value processing

    std_y  = StandardScaler()
    y_train = std_y.fit_transform(y_train)
    y_test = std_y.transform(y_test)

    # 3. Estimator process

    # LinearRegression
    lr = LinearRegression()

    lr.fit(x_train,y_train)

    # print(lr.coef_)

    y_lr_predict = lr.predict(x_test)

    y_lr_predict = std_y.inverse_transform(y_lr_predict)

    print("Lr Predicted value:",y_lr_predict)


    # SGDRegressor
    sgd = SGDRegressor()

    sgd.fit(x_train,y_train)

    # print(sgd.coef_)

    y_sgd_predict = sgd.predict(x_test)

    y_sgd_predict = std_y.inverse_transform(y_sgd_predict)

    print("SGD Predicted value:",y_sgd_predict)

    # Ridge regression with regularization

    rd = Ridge(alpha=0.01)

    rd.fit(x_train,y_train)

    y_rd_predict = rd.predict(x_test)

    y_rd_predict = std_y.inverse_transform(y_rd_predict)

    print(rd.coef_)

    # Evaluation results of two models

    print("lr The mean square error of is:",mean_squared_error(std_y.inverse_transform(y_test),y_lr_predict))

    print("SGD The mean square error of is:",mean_squared_error(std_y.inverse_transform(y_test),y_sgd_predict))

    print("Ridge The mean square error of is:",mean_squared_error(std_y.inverse_transform(y_test),y_rd_predict))

    return None
432 original articles published, praised 196, 20000 visitors+
Private letter follow

Posted by pinxue on Sat, 01 Feb 2020 03:36:31 -0800