Mathematical derivation + pure Python implementation of machine learning algorithm 1: linear regression

Keywords: Python Machine Learning

     When learning machine learning, many students start programming directly after a rough look at the theory, which is very commendable. However, it's not really a handwriting algorithm, but to directly call a package such as sklearn, which is not appropriate. The author is not saying that the package transfer is bad. In practical work and research, the encapsulated and easy-to-use package has brought great convenience to our work and greatly improved the implementation efficiency of our machine learning model and algorithm. However, this is limited to the use process.

     I believe that many ambitious students are certainly not satisfied with using these packages without knowing the details of models and algorithms. Therefore, if you are a learner of machine learning algorithm, you'd better not use these encapsulated packages as soon as you come up in the learning process. Instead, according to your own understanding of the algorithm, after pushing the mathematical formula of the model and algorithm by hand, you can only rely on basic packages such as numpy and pandas to write the machine learning algorithm. After this process, learn how to call sklearn and other machine learning libraries. I believe you can better experience the convenience and fun of packet switching. After that, go to data practice and play games. I believe you will become an excellent machine learning algorithm engineer.

     The two topics of this machine learning series are mathematical derivation + pure numpy implementation. Let's start with the most basic linear regression model. I believe you must be quite familiar with the regression algorithm, especially our students with statistical background. Therefore, the author makes a direct mathematical derivation.

Mathematical derivation of regression analysis

     Originally, I wanted to use the author's hand push draft, but the handwriting is too publicized, and writing formulas in word or markdown is too time-consuming. Here, I directly borrow the derivation process in teacher Zhou Zhihua's machine learning textbook:


 

     Extended to matrix form:

     The above is the derivation process of parameter estimation in linear regression model.

numpy implementation of regression analysis

     According to the Convention, we need to sort out the writing ideas before writing the algorithm. The main part of the regression model is relatively simple. The key is how to update the parameters based on gradient descent after giving the mse loss function. Firstly, we need to write the main body of the model, the loss function and the parameter derivation results based on the loss function, then initialize the parameters, and finally write the parameter update process based on the gradient descent method. Of course, we can also write cross validation to get more robust parameter estimates. Don't say much, just go to the code.

     Regression model body:

import numpy as np

def linear_loss(X, y, w, b):    
    num_train = X.shape[0]    
    num_feature = X.shape[1]    
    # Model formula    
    y_hat = np.dot(X, w) + b    
    # Loss function   
    loss = np.sum((y_hat-y)**2)/num_train    
    # Partial derivative of parameters    
    dw = np.dot(X.T, (y_hat-y)) /num_train    
    db = np.sum((y_hat-y)) /num_train    
    return y_hat, loss, dw, db

 

Parameter initialization:

def initialize_params(dims):    
    w = np.zeros((dims, 1))   
    b = 0    
    return w, b

     Model training process based on gradient descent:

def linar_train(X, y, learning_rate, epochs):    
    w, b = initialize(X.shape[1])  
    loss_list = []  
    for i in range(1, epochs):        
    # Calculate the current predicted value, loss and parameter partial derivative        
        y_hat, loss, dw, db = linar_loss(X, y, w, b)  
        loss_list.append(loss)      
        # Parameter updating process based on gradient descent        
        w += -learning_rate * dw        
        b += -learning_rate * db        
        # Print iterations and losses        
        if i % 10000 == 0:            
            print('epoch %d loss %f' % (i, loss))
              
       # Save parameters        
        params = {            
           'w': w,            
           'b': b        
        }        
       
       # Save gradient        
        grads = {            
           'dw': dw,            
           'db': db        
        }    
           
    return loss_list, loss, params, grads

     The above is the basic implementation process of linear regression model. Let's take the diabetes dataset in sklearn as an example for simple training.

     Data preparation:

from sklearn.datasets import load_diabetes
from sklearn.utils import shuffle
 
diabetes = load_diabetes() 
data = diabetes.data 
target = diabetes.target 

# Scramble data
X, y = shuffle(data, target, random_state=13) 
X = X.astype(np.float32)

# Simple division of training set and test set
offset = int(X.shape[0] * 0.9) 
X_train, y_train = X[:offset], y[:offset] 
X_test, y_test = X[offset:], y[offset:] 
y_train = y_train.reshape((-1,1)) 
y_test = y_test.reshape((-1,1))
 
print('X_train=', X_train.shape) 
print('X_test=', X_test.shape) 
print('y_train=', y_train.shape) 
print('y_test=', y_test.shape)

     Perform training:

loss_list, loss, params, grads = linar_train(X_train, y_train, 0.001, 100000)

     View the regression model parameter values obtained from training:

print(params)

     The following defines a prediction function to predict the test set results:

def predict(X, params):   
    w = params['w']    
    b = params['b']    

    y_pred = np.dot(X, w) + b    
    return y_pred 

y_pred = predict(X_test, params) 
y_pred[:5]

     Use matplotlib to display the prediction results and true values:

import matplotlib.pyplot as plt 
f = X_test.dot(params['w']) + params['b'] 

plt.scatter(range(X_test.shape[0]), y_test) 
plt.plot(f, color = 'darkorange') 
plt.xlabel('X') 
plt.ylabel('y') 
plt.show()

     It can be seen that the data of all variables is not good for the fitting and combination of linear regression model. First, the distribution of the data itself, and second, the fitting effect of simple linear model is poor for the data. Of course, we just want to demonstrate the basic process of linear regression model, and don't care about the effect.

     Loss reduction during training:

plt.plot(loss_list, color = 'blue') 
plt.xlabel('epochs') 
plt.ylabel('loss') 
plt.show()

Encapsulate a linear regression class

     The author makes a simple class encapsulation for the above process, in which a user-defined cross validation process is added for training:

import numpy as np
from sklearn.utils import shuffle
from sklearn.datasets import load_diabetes

class lr_model():    
    def __init__(self):        
        pass    

    def prepare_data(self):        
        data = load_diabetes().data        
        target = load_diabetes().target        
        X, y = shuffle(data, target, random_state=42)        
        X = X.astype(np.float32)            
        y = y.reshape((-1, 1))        
        data = np.concatenate((X, y), axis=1)        
        return data  
        
    def initialize_params(self, dims):        
        w = np.zeros((dims, 1))        
        b = 0        
        return w, b    
   
    def linear_loss(self, X, y, w, b):        
        num_train = X.shape[0]        
        num_feature = X.shape[1]
        
        y_hat = np.dot(X, w) + b        
        loss = np.sum((y_hat-y)**2) / num_train        
        dw = np.dot(X.T, (y_hat - y)) / num_train        
        db = np.sum((y_hat - y)) / num_train        
        return y_hat, loss, dw, db    
   
    def linear_train(self, X, y, learning_rate, epochs):        
        w, b = self.initialize_params(X.shape[1])        
        for i in range(1, epochs):            
            y_hat, loss, dw, db = self.linear_loss(X, y, w, b)            
            w += -learning_rate * dw                
            b += -learning_rate * db                
            if i % 10000 == 0:                
                print('epoch %d loss %f' % (i, loss))
            params = {                
               'w': w,                
               'b': b            
            }            
            grads = {                
               'dw': dw,                
               'db': db            
            }        
        return loss, params, grads    
   
    def predict(self, X, params):        
        w = params['w']        
        b = params['b']        
        y_pred = np.dot(X, w) + b        
        return y_pred    
       
    def linear_cross_validation(self, data, k, randomize=True):        
        if randomize:            
            data = list(data)            
            shuffle(data)
        
        slices = [data[i::k] for i in range(k)]        
        for i in range(k):            
            validation = slices[i]            
            train = [data                        
            for s in slices if s is not validation for data in s]            
            train = np.array(train)            
            validation = np.array(validation)            
            yield train, validation
           
           
if __name__ == '__main__':    
    lr = lr_model()    
    data = lr.prepare_data()
  
    for train, validation in lr.linear_cross_validation(data, 5):        
        X_train = train[:, :10]        
        y_train = train[:, -1].reshape((-1, 1))        
        X_valid = validation[:, :10]        
        y_valid = validation[:, -1].reshape((-1, 1))        

        loss5 = []        
        loss, params, grads = lr.linear_train(X_train, y_train, 0.001, 100000)
        loss5.append(loss)        
        score = np.mean(loss5)        
        print('five kold cross validation score is', score)        
        y_pred = lr.predict(X_valid, params)        
        valid_score = np.sum(((y_pred - y_valid) ** 2)) / len(X_valid)
        print('valid score is', valid_score)
 

The above is the content of this section. A simple linear regression model is manually implemented based on numpy.

Posted by KyleVA on Tue, 07 Sep 2021 20:03:40 -0700