A series of artificial intelligence experiments -- Python implementation of gradient test

Keywords: Python Machine Learning AI neural networks Deep Learning

In the actual process of building neural network, forward propagation is easy to realize and has high correctness; The implementation of back propagation is difficult, and there are often bug s. For items requiring high accuracy, gradient test is particularly important.

Principle of gradient test

The definition of derivative (gradient) in mathematics is
∂ J ∂ θ = lim ⁡ ε → 0 J ( θ + ε ) − J ( θ − ε ) 2 ε \frac{\partial J}{\partial \theta} =\lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2\varepsilon} ∂θ∂J​=ε→0lim​2εJ(θ+ε)−J(θ−ε)​
We need to verify the results of back propagation calculation ∂ J ∂ θ \frac{\partial J}{\partial \theta} ∂ θ ∂ J ∂ can be calculated by forward propagation in another way, that is, the above formula J ( θ + ε ) J(\theta + \varepsilon) J( θ+ε) and J ( θ − ε ) J(\theta - \varepsilon) J( θ − ε) To get ∂ J ∂ θ \frac{\partial J}{\partial \theta} ∂ θ ∂ J, verify whether it is the same as that calculated by back propagation.

Python implementation of gradient test

We simply build a 3-layer neural network, as shown in the figure below

def gradient_check_n_test_case(): 
    np.random.seed(1)
    x = np.random.randn(4,3)
    y = np.array([1, 1, 0])
    W1 = np.random.randn(5,4) 
    b1 = np.random.randn(5,1) 
    W2 = np.random.randn(3,5) 
    b2 = np.random.randn(3,1) 
    W3 = np.random.randn(1,3) 
    b3 = np.random.randn(1,1) 
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2,
                  "W3": W3,
                  "b3": b3}

    
    return x, y, parameters

Realize its forward propagation and back propagation respectively (two errors are deliberately added in the back propagation)

def forward_propagation_n(X, Y, parameters):
    m = X.shape[1]
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    # RELU -> RELU -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)

    logprobs = np.multiply(-np.log(A3), Y) + np.multiply(-np.log(1 - A3), 1 - Y)
    cost = 1. / m * np.sum(logprobs)

    cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)

    return cost, cache


def backward_propagation_n(X, Y, cache):
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1. / m * np.dot(dZ2, A1.T) * 2  # Error 1
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1. / m * np.dot(dZ1, X.T)
    db1 = 4. / m * np.sum(dZ1, axis=1, keepdims=True)  # Error 2

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
                 "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
                 "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

We pass a one-dimensional column vector g r a d a p p r o x gradapprox Gradaprox preserves the gradient obtained by forward propagation, and each element corresponds to the gradient of a parameter g r a d grad grad is compared with it to judge whether the error is too large.
The formula for calculating the comparison is
d i f f e r e n c e = ∥ g r a d − g r a d a p p r o x ∥ 2 ∥ g r a d ∥ 2 + ∥ g r a d a p p r o x ∥ 2 difference = \frac{\left \|grad - gradapprox \right \| _2}{\left \|grad \right \| _2+\left \|gradapprox \right \| _2} difference=∥grad∥2​+∥gradapprox∥2​∥grad−gradapprox∥2​​
numpy's norm function is used to calculate the norm of the matrix.

def gradient_check_n(parameters, gradients, X, Y, epsilon=1e-7):
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))

    # Calculate gradaprox
    for i in range(num_parameters):
        thetaplus = np.copy(parameters_values)
        thetaplus[i][0] = thetaplus[i][0] + epsilon
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))

        thetaminus = np.copy(parameters_values)
        thetaminus[i][0] = thetaminus[i][0] - epsilon
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))

        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)

    numerator = np.linalg.norm(grad - gradapprox)
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
    difference = numerator / denominator

    if difference < 2e-7:
        print("backward propagation is wrong! difference = " + str(difference))
    else:
        print("backward propagation is right! difference = " + str(difference))

    return difference

The operation results are as follows

X, Y, parameters = gradient_check_n_test_case()

cost, cache = forward_propagation_n(X, Y, parameters)
gradients = backward_propagation_n(X, Y, cache)
difference = gradient_check_n(parameters, gradients, X, Y)


For the complete code of this experiment, see:
https://github.com/PPPerry/AI_projects/tree/main/6.gradient_check

Tips

  1. The gradient test is very slow. Using an approximate formula to calculate the gradient is very computationally expensive. It can only be turned on when it is necessary to verify whether the code is correct. After confirming that the code is OK, turn off the gradient test.
  2. Gradient test cannot coexist with dropout.

Previous AI series experiments:

A series of artificial intelligence experiments (I) -- binary classification single layer neural network for cat recognition
Series of artificial intelligence experiments (II) -- shallow neural network for distinguishing different color regions
Series of experiments on artificial intelligence (III) -- binary classification depth neural network for cat recognition
Series of experiments on artificial intelligence (IV) -- comparison of various neural network parameter initialization methods (Xavier initialization and He initialization)
Series of experiments on artificial intelligence (V) -- regularization method: Python implementation of L2 regularization and dropout

Posted by msaspence on Tue, 05 Oct 2021 10:35:50 -0700