Fundamentals of machine learning and neural networks, pytoch framework

Keywords: Machine Learning neural networks Pytorch

Fundamentals of machine learning

The essence of machine learning: using data to solve problems

Data preprocessing (important in deep learning) - > training phase - > model generation - > prediction phase

We usually choose some data as the test set, such as about 20%.

Sometimes there is an additional verification set of about 20%

  • Training phase: create a prediction model through data training and fine tune it.
  • Model generation: the prediction model can find the answer behind these data and help us solve a problem.
  • Prediction stage: complete the model evaluation through the test set, so as to understand the effectiveness of the model in the test set.

In this process, the prediction model will be continuously improved and used

Further, the steps of machine learning are as follows:

Data collection - > data preprocessing - > model selection - > Training - > evaluation - > superparameter adjustment - > prediction

loss function: an index to evaluate machine learning

Example: forecast house prices

For features such as "region", it is necessary to discretize them by one hot coding, label coding, etc.

Tag encoding is often used to convert data elements of non numeric type [string, etc.)

For example, Haidian, Tongzhou and Chaoyang correspond to values 1, 2 and 3 respectively

Machine learning often has many features. Based on these features, we need to train the weight w in the Model. The features composed of these eigenvalues are called weight matrix Weights, and there are biases.

Using linear regression to predict house prices

Violent exhaustion:
import numpy as np
import matplotlib.pyplot as plt

#MAE loss function
def MAE_loss(y,y_hat):
    return np.mean(np.abs(y_hat - y))
#MSE loss function
def MSE_loss(y,y_hat):
    return np.mean(np.square(y_hat - y))
#linear regression
def linear(x,k,b):
    y = k * x + b
    return y

if __name__ == '__main__':
    x = np.array([50,80,100,200])
    y = np.array([82,119,172,302])
    #Initialize minimum loss function value
    min_loss = float('inf')
    #Violent exhaustive method
    for k in np.arange(-2,2,0.1):
        for b in np.arange(-10,10,0.1):
            #Calculate predicted value
            y_hat = [linear(xi,k,b) for xi in list(x)]
            #Calculation loss function
            current_mae_loss = MAE_loss(y,y_hat)
            #Record the minimum k and b of the loss function
            if current_mae_loss < min_loss:
                min_loss = current_mae_loss
                best_k,best_b = k,b
    print("best k:{},best b:{}".format(best_k,best_b))

    y_hat = best_k * x + best_b


Gradient descent optimization method:
  1. Batch gradient descent
    • Use all samples at each update
    • Stable, slow convergence
  2. Random gradient descent
    • One sample is used for each update, and one sample is used to approximate all samples
    • Faster convergence, and the final solution is near the global optimal solution
  3. Mini batch gradient decrease
    • b samples are used for each update, which is a compromise method
    • Faster speed
#Batch gradient descent method
import numpy as np
import matplotlib.pyplot as plt
#Mean absolute error
def MAE_loss(y,y_hat):
    return np.mean(np.abs(y_hat - y))
#Mean square error
def MSE_loss(y,y_hat):
    return np.mean(np.square(y_hat - y))

def linear(x,k,b):
    return k * x + b

#gradient of k
def gradient_k(x,y,y_hat):
    n = len(y)
    gradient = 0
    for xi,yi,yi_hat in zip(list(x),list(y),list(y_hat)):
        gradient += (yi_hat - yi) * xi
    return gradient / n

#gradient of b
def gradient_b(y,y_hat):
    n = len(y)
    gradient = 0
    for yi,yi_hat in zip(list(y),list(y_hat)):
        gradient += (yi_hat - yi)
    return gradient / n

if __name__ == '__main__':
    #Initial data
    x = np.array([50,80,100,200])
    y = np.array([82,119,172,302])
    #Data preprocessing
    max_x = np.max(x)
    max_y = 1
    x = x / max_x
    y = y / max_y
    #Initialization parameters
    times = 1000 #Number of iterations
    min_loss = float('inf') #Minimum loss function value
    current_k = 10
    current_b = 10
    learn_rate = 0.1
    #Start iteration
    for i in range(times):
        y_hat = [linear(xi,current_k,current_b) for xi in list(x)]
        current_loss = MSE_loss(y,y_hat)
        if current_loss < min_loss:
            min_loss = current_loss
            best_k,best_b = current_k,current_b
            print('best_k :{},best_b :{}'.format(best_k,best_b))
        #Calculated gradient
        k_gradient = gradient_k(x,y,y_hat)
        b_gradient = gradient_b(y,y_hat)
        #Update weight
        current_k = current_k - learn_rate * k_gradient
        current_b = current_b - learn_rate * b_gradient

    best_k = best_k / max_x * max_y
    best_b = best_b / max_x * max_y

    x = x * max_x
    y = y * max_y


Training process:

The process of machine learning is to search w and b in the search space, so that the accuracy of the model can reach a certain standard. A training, called an iteration, aims to update the weights and variables. After continuous iteration, the parameters in the model are continuously updated. When the training is completed, the model can be used to predict the house price.

What is the regression problem? What is a classification problem?

The regression problem is actually the solution of real numbers, and the classification problem is a discrete problem.

Judgment method: output data type, discrete or continuous

What is linear regression and what is logistic regression?

Linear regression solves the regression problem. Logistic regression solves the classification problem and is a classification algorithm [using sigmod function]

Error, variance and deviation

loss = bias + variance

Expected value of error = variance of noise + variance of model predicted value + square of deviation of predicted value from real value

  • Red dot: true value of test sample
  • Orange dot: true value of test sample + noise
  • Purple point: each specific model predicts the test sample once, which corresponds to a point
  • Blue dot: average of all predicted values [expectation of predicted values]
  • Light blue circle: the dispersion of the predicted value, that is, the variance of the predicted value


The deviation reflects the error between the output of the model and the real value.

How to reduce?

  • Increase the complexity of the algorithm
    For example, linear regression increases the number of parameters to improve the complexity of the model; Neural network increases the complexity of the model by increasing the number of hidden units.
  • Optimize input features - > add more features
  • Weakening or removing existing regularization constraints = > may increase variance
  • Adjust model structure


The larger the variance, the easier it is to over fit the model.

In case of large variance and over fitting:

  • Increasing the amount of data in the training set will not affect the deviation.
  • Regularization, the model is too biased towards the regular term, which may affect the deviation
  • A variety of model training data are used, and the final model is selected by means of cross validation


In order to reduce the complexity of the model.

Challenges of machine learning

  • Training error refers to the error on the training set = > under fitting, and the model cannot obtain sufficient error on the training set
  • Test error refers to the error on the test set = > over fitting, and the gap between training error and test error is too large

Fundamentals of neural network

Using numpy to simulate forward propagation

Activation function



  • The input is a continuous real value, and the output result is between 0 and 1
  • The output result of negative infinity is 0, and the output result of positive infinity is 1

  • If the weight of the initial neural network is a random value between [0,1], it can be seen from the back-propagation algorithm that when the gradient propagates from back to front, the gradient value of each layer will be reduced to 0.25 times of the original. If the neural network has many layers, the gradient will become very small (close to 0) after multi-layer propagation, that is, the gradient disappears

  • When the network weight is initialized to the value in the (1, + ∞) interval, a gradient explosion will occur


  • y=tanh(x) is an odd function, that is, the strictly monotonic increasing curve of the image passing through the origin and passing through the first and third quadrants
  • Also known as hyperbolic tangent function, the function curve is similar to Sigmoid function and can be transformed by scaling and translation

  • Compared with sigmoid function, the mean of tanh is 0
  • tanh is easier to train than Sigmoid function and has advantages
  • It is used for normalized regression problem, and its output is in the range of [- 1,1]. Usually used with L2 loss


  • Unilateral suppression, the sparse model realized by ReLU can better mine relevant features and fit training data

  • For linear functions, ReLU has stronger expression ability, especially in depth networks

  • For the nonlinear function, because the gradient of the non negative interval of ReLU is constant, there is no gradient disappearance problem, so that the convergence rate of the model is maintained in a stable state

To sum up:

  • The output values of sigmoid and tanh functions are between (0,1) and (- 1,1) = > suitable for processing probability values; Gradient vanishing = > not suitable for deep network training
  • The effective derivative of relu is constant 1, which solves the gradient disappearance problem in deep network = > it is more suitable for deep network training

Gradient disappearance

  • When the gradient is less than 1, the error between the predicted value and the real value will decay once per propagation layer. This phenomenon is particularly obvious if sigmoid is used as the activation function in the deep model, which will lead to the stagnation of model convergence
  • When the gradient disappears, the weight update of the hidden layer close to the output layer is relatively normal because its gradient is relatively normal. However, when it is closer to the input layer, the weight update of the hidden layer close to the input layer will be slow or stagnant due to the disappearance of the gradient = > this will only be equivalent to the learning of the shallow network of the following layers during training

Why do we need to use nonlinear activation functions in neural networks?

  • If the activation function is not used, it is equivalent to the excitation function f(x) = x. at this time, the input of each layer node is the linear function of the output of the upper layer. Then, no matter how many layers the neural network has, the output is the linear combination of inputs = > which is equivalent to the effect of no hidden layer
  • The nonlinear function is introduced as the excitation function, so the expression ability of neural network will be more powerful = > it is no longer a linear combination of inputs, but can approximate almost any function

Complete neural network learning process

  1. Define the network structure (specify the size of output layer, hidden layer and output layer)

  2. Initialize model parameters

  3. Cyclic operation:

    3.1 implementation of forward communication

    3.2 calculation of loss function

    3.3 post implementation communication

    3.4 weight update

Back propagation

Function of weight matrix

  • Forward propagation: transmitting input signals
  • Back propagation: error in transmitting output

Using numpy to implement a neural network

import numpy as np
import matplotlib.pyplot as plt

if __name__ == '__main__':
    #n: Sample size; d_in: enter a dimension; h: Hidden layer dimension; d_out: output dimension
    n,d_in,h,d_out = 64,1000,100,10
    #Randomly generated data
    x = np.random.randn(n,d_in)  # n line d_in column with standard normal distribution
    y = np.random.randn(n,d_out) # n line d_out column with standard normal distribution
    #Random initialization weight
    w1  = np.random.randn(d_in,h) #Enter layer to hidden layer weights
    w2  = np.random.randn(h,d_out) #Hide layer to output layer weights
    #Set learning rate
    learn_rate = 1e-6
    #Number of iterations
    times = 200
    for i in range(times):
        #Forward propagation calculation predicted value y
        temp =
        temp_relu = np.maximum(temp,0) #Compare the size of two array s element by element. If it is less than 0, it is 0
        y_pred =
        #Calculation loss function
        loss = np.square(y_pred - y).sum() / n
        #Back propagation
        grad_y_pred = 2.0 * (y_pred - y)
        #Calculate w2 gradient
        grad_w2 =
        grad_temp_relu =
        grad_temp = grad_temp_relu.copy()
        #relu activation function, less than 0 is 0
        grad_temp[temp < 0] = 0
        #Calculate w1 gradient
        grad_w1 =
        #Update weight
        w1 -= learn_rate * grad_w1
        w2 -= learn_rate * grad_w2

Using numpy to complete boston house price prediction based on Neural Network

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.utils import shuffle, resample

# relu function
def Relu(x):
    res = np.where(x < 0,0,x)
    return res
# Define loss function
def MSE_loss(y, y_hat):
    return np.mean(np.square(y_hat - y))
# Define linear regression function
def Linear(X, W1, b1):
    y = + b1
    return y

if __name__ == '__main__':
    #Data loading
    data = load_boston()
    X_ = data['data']
    y = data['target']
    #Convert y to matrix form
    y = y.reshape(y.shape[0],1) #shape[0] read the length of dimension 0 reshape: modify to the specified dimension size
    #Data normalization axis = 0: compress rows, calculate the mean value of each column, and return a matrix of 1*n
    X_ = (X_ - np.mean(X_,axis=0)) / np.std(X_,axis=0)
        Initialize network parameters
        Define hidden layer dimensions n_hidden,w1,b1,w2,b2
    n_features = X_.shape[1] #Gets the dimension size of the column, that is, the number of features
    n_hidden = 10
    w1 = np.random.randn(n_features, n_hidden)
    b1 = np.zeros(n_hidden)
    w2 = np.random.randn(n_hidden, 1)
    b2 = np.zeros(1)
    # Set learning rate
    learning_rate = 1e-6
    # 5000 iterations
    for t in range(5000):
        # Forward propagation, calculate the predicted value Y (linear - > relu - > linear)
        l1 = Linear(X_,w1,b1)
        s1 = Relu(l1)
        y_pred = Linear(s1,w2,b2)
        # Calculate the loss function and output the loss of each epoch
        loss = MSE_loss(y,y_pred)
        # Back propagation, calculate the gradient of w1 and w2 based on loss
        grad_y_pred = 2 *(y_pred - y)
        grad_w2 =
        grad_temp_relu =
        grad_temp_relu[l1 < 0] = 0
        grad_w1 =
        # Update the weight and update W1, W2, B1 and B2
        w1 = w1 - learning_rate * grad_w1
        w2 = w2 - learning_rate * grad_w2
    # Get the final w1, w2
    print('w1={} \n w2={}'.format(w1, w2))

Binary classification using PyTorch custom neural network

from sklearn.datasets import make_moons
import torch
import numpy as np
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import accuracy_score

#Define the network model and inherit nn.Module
class Net(nn.Module):
    def __init__(self):
        #Define the first FC layer and use Linear transformation
        self.fc1 = nn.Linear(2,100)
        #Define the second FC layer and use Linear transformation
        self.fc2 = nn.Linear(100,2)
    #Forward propagation
    def forward(self,x):
        #Layer 1 output
        x = self.fc1(x)
        #Active layer
        x = torch.tanh(x)
        #Output layer
        x = self.fc2(x)
        return x
    #Prediction results
    def predict(self,x):
        pred = self.forward(x)
        ans = []
        for t in pred:
            if t[0] > t[1]:
        return torch.tensor(ans)

# Predict and convert to numpy type
def predict(x):
    x = torch.from_numpy(x).type(torch.FloatTensor)
    ans = model.predict(x)
    return ans.numpy()

# Draw two classification decision surface
def plot_decision_boundary(pred_func, X, y):
    # Set min and max values and give it some padding
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    h = 0.01
    # Calculation decision surface
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    # np.c_ Connecting two matrices by row is to add the two matrices left and right
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Draw classification decision surface
    plt.contourf(xx, yy, Z,
    # Draw sample points
    plt.scatter(X[:, 0], X[:, 1], c=y,

if __name__ == '__main__':
    # Using make_moon built-in generation model randomly generates binary data and 200 samples
    X,y = make_moons(200,noise=0.2)
    cm ='RdYlBu')
    #X[:,0]: take the 0th data from all arrays X[:,1]: take the 1st data from all arrays
    #Arrays are converted to tensors, and they share memory
    X = torch.from_numpy(X).type(torch.FloatTensor)
    y = torch.from_numpy(y).type(torch.LongTensor)
    #Initialization model
    model = Net()
    #Define evaluation criteria
    criterion = nn.CrossEntropyLoss()
    #Define optimizer
    print(f'\nParameters: {np.sum([param.numel() for param in model.parameters()])}')
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    #Number of iterations
    times = 1000
    #Store loss per iteration
    losses = []
    for i in range(times):
        #Predict the input x
        y_pred = model.forward(X)
        #The loss function is obtained
        loss = criterion(y_pred,y)
        #Gradient before emptying
        #Calculated gradient
        #Adjust weight
    print(accuracy_score(model.predict(X), y))
    plot_decision_boundary(lambda x: predict(x), X.numpy(), y.numpy())

Output results:

Posted by hyngvesson on Wed, 08 Sep 2021 00:38:35 -0700