Part I: Deep learning and machine learning
1. Machine learning tasks
Review our steps in completing a machine learning task:
(1) First of all, the data needs to be preprocessed. The important steps include the uniformity of data format and necessary data transformation, as well as the division of training set and test set.
(2) Next, select the model, and set the loss function and optimization function, as well as the corresponding hyperparameters (of course, you can use the loss function and optimizer that comes with the model in a machine learning library like sklearn).
(3) Finally, the training set data is fitted with the model, and the model performance is calculated on the validation set/test set.
2. Differences between ML and DL
(1) Data loading
Deep learning is similar to machine learning in process, but there are big differences in code implementation.
(1) First, because of the large number of samples required for indepth learning, loading all data at once may run beyond the memory capacity and cannot be achieved;
(2) There are also batch training strategies to improve model performance, which require a fixed number of samples to be read into the model training each time, so indepth learning needs a special design on data loading.
(2) Model implementation
In model implementation, deep learning and machine learning are also very different:

Because there are many layers of deep neural networks, and there are also some layers (such as convolution layer, pooling layer, batch regularization layer, LSTM layer, etc.) to implement specific functions, deep neural networks often need to be built "layer by layer" or predefined modules that can achieve specific functions and then assembled.The model construction method can fully guarantee the flexibility of the model, and also put forward new requirements for code implementation.

Next comes the setup of the loss function and the optimizer. This part is similar to the classic machine learning implementation. However, due to the flexibility of model setup, the loss function and the optimizer must be able to guarantee that the reverse propagation can be implemented on a userdefined model structure.
(3) Training process
Once these steps are complete, you can start training.
(1) The concept of GPU and the function of GPU for parallel computing acceleration, but the program runs on CPU by default, so in the code implementation, the model and data need to be "put on" GPU for calculation, and the loss function and optimizer need to be able to work on GPU.
(2) If multiple GPU s are used for training, model and data allocation and integration should also be considered.
(3) In the subsequent calculation of some indicators, the data needs to be "put back" to the CPU. This involves a series of configuration and operation of the GPU.
(4)The most important feature of the training and validation process in deep learning is that the data is read in batches, one batch at a time, put into the GPU for training, then propagate the loss function back to the front layer of the network, and use the optimizer to adjust the network parameters. This will involve the problem of cooperation of each module. After training/validation, it also needs to calculate according to the set indicators.Type performance.
After the above steps, a deep learning task is completed.
Part Two: Pytorch Part
1. Learning Resources
(1) Awesomepytorchlist : At present, it has been awarded 12K Star, which includes NLP,CV, common libraries, paper implementations, and other Pytorch projects.
(2)PyTorch Official Documentation : The official documents published are very rich.
(3)Pytorchhandbook : 14.8K in hand on GitHub.
(4)PyTorch Official Community : Here you can communicate with people who are developing pytorch.
In addition, there are many resources to learn pytorch, such as bstation, stackoverflow, and so on.
2. Automatic derivation mechanism
In PyTorch, the core of all neural networks is the autograd package. The autograd package provides an automatic derivation mechanism for all operations on a tensor. It is a framework for defining byrun at run time, which means that reverse propagation is determined by how the code runs, and each iteration can be different.
2.1 torch.Tensor class
torch.Tensor is the core class of this package. If its property.requires_grad is set to True, it will track all operations on the tensor. When the calculation is complete, all gradients can be automatically calculated by calling.backward(). All gradients of this tensor will automatically be added to the.Grad property.
Note: In y.backward(), if y is a scalar, no parameter needs to be passed in for backward(); otherwise, if y is not a scalar, a Tensor identical to y needs to be passed in.
To prevent a tensor from being tracked history, you can call the.detach() method to separate it from the calculation history and prevent its future calculation records from being tracked. To prevent tracing history (and using memory), you can wrap the code block in with torch.no_grad() Medium. This is particularly useful when evaluating models because they may have trainable parameters requiring_grad = True, but we do not need to calculate gradients for them in the process.
2.2 Function Class
There is also a class that is important for the implementation of autograd: Function. Tensor and Function are connected to each other to produce an acyclic graph, which encodes the complete calculation history. Each tensor has a.Grad_fn attribute that references the Function that created Tensor itself (unless the tensor was manually created by the user, that is, the grad_fn of the tensor is None).
If you need to calculate the derivative, you can call.backward() on Tensor. If Tensor is a scalar (that is, it contains data for one element), you do not need to specify any parameters for backward(), but if it has more elements, you need to specify a gradient parameter, which is a shapematching tensor.
2.3 Jacobian Matrix
Mathematically, if there is a vector function
y
⃗
=
f
(
x
⃗
)
\vec{y}=f(\vec{x})
y
=f(x
), then
y
⃗
\vec{y}
y
About
x
⃗
\vec{x}
x
The gradient of is a Jacobian matrix:
J
=
(
∂
y
1
∂
x
1
⋯
∂
y
1
∂
x
n
⋮
⋱
⋮
∂
y
m
∂
x
1
⋯
∂
y
m
∂
x
n
)
J=\left(\begin{array}{ccc}\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}\end{array}\right)
J=⎝⎜⎛∂x1∂y1⋮∂x1∂ym⋯⋱⋯∂xn∂y1⋮∂xn∂ym⎠⎟⎞
The torch.autograd package is used to calculate the product of some Jacobian matrices. For example, if
v
v
v is a scalar function
l
=
g
(
y
⃗
)
l = g(\vec{y})
l=g(y
) Gradient:
v
=
(
∂
l
∂
y
1
⋯
∂
l
∂
y
m
)
v=\left(\begin{array}{lll}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)
v=(∂y1∂l⋯∂ym∂l)
From the chain rule, we can get:
v
J
=
(
∂
l
∂
y
1
⋯
∂
l
∂
y
m
)
(
∂
y
1
∂
x
1
⋯
∂
y
1
∂
x
n
⋮
⋱
⋮
∂
y
m
∂
x
1
⋯
∂
y
m
∂
x
n
)
=
(
∂
l
∂
x
1
⋯
∂
l
∂
x
n
)
v J=\left(\begin{array}{lll}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)\left(\begin{array}{ccc}\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}\end{array}\right)=\left(\begin{array}{lll}\frac{\partial l}{\partial x_{1}} & \cdots & \frac{\partial l}{\partial x_{n}}\end{array}\right)
vJ=(∂y1∂l⋯∂ym∂l)⎝⎜⎛∂x1∂y1⋮∂x1∂ym⋯⋱⋯∂xn∂y1⋮∂xn∂ym⎠⎟⎞=(∂x1∂l⋯∂xn∂l)
Note: grads are cumulative in the reverse propagation process, which means that for each reverse propagation, the gradient will add up the previous gradient, so it is generally necessary to zero the gradient before the reverse propagation.
2.4 Code Chestnuts
# * coding: utf8 * """ Created on Fri Oct 15 21:07:32 2021 @author: 86493 """ import torch # require_grad=True is used to track computing history x = torch.ones(2, 2, requires_grad = True) print(x) print('' * 50) # Exponential operations on tensors y = x ** 2 print(y) # y is the result of the calculation, so the grad_fn attribute print(y.grad_fn) print('' * 50) z = y * y * 3 out = z.mean() # Calculate the average of all elements print("z:", z) print("out:", out) print('' * 50) # Require_grad defaults to False a = torch.randn(2, 2) print("Initial a Values are:\n", a) a = ((a * 3) / (a  1)) print("After operation a Values are:\n", a) print(a.requires_grad) # Default to False a.requires_grad_(True) print(a.requires_grad) b = (a * a).sum() print(b.grad_fn) # b is the result of the calculation, all it has a grad_fn attribute print('' * 50) # ================================== # Finding gradients out.backward() # out is a scalar print(x.grad) # Input derivative d(out)/dx print('' *50) # Next, reverse propagation, note that grads are cumulative (add one more gradient) # out2.backward() # print(x.grad) out3 = x.sum() # Generally, the gradient is cleared before reverse propagation (to prevent accumulation) x.grad.data.zero_() out3.backward() print(x.grad) print('' *50) # Jacobian Vector Product x = torch.randn(3, requires_grad = True) print(x) y = x * 2 i = 0 # Norm is the L2 norm for this tensor while y.data.norm() < 1000: y = y * 2 i = i + 1 print("y:\n", y, '\n') print("i:", i) v = torch.tensor([0.1, 1.0, 0.0001], dtype = torch.float) y.backward(v) print("x.grad:\n", x.grad) # You can wrap code blocks in with torch.no_grad() # To prevent autograd tracing from setting requires_grad=True print(x.requires_grad) print((x ** 2).requires_grad) with torch.no_grad(): print((x ** 2).requires_grad) print('' *50) # Want to modify tensor's value, but don't want to be recorded by autograd # That is, it does not affect reverse propagation and can operate on tensor.data x = torch.ones(1, requires_grad = True) print("x: ", x) print(x.data) # Or a tensor # But it's already independent of the computational graph print(x.data.requires_grad) y = 2 * x # Only values are changed and not recorded in the computational diagram, so gradient propagation is not affected. x.data *= 100 y.backward() # Changing the data value also affects the tensor value print(x) print(x.grad)
The results are:
tensor([[1., 1.], [1., 1.]], requires_grad=True)  tensor([[1., 1.], [1., 1.]], grad_fn=<PowBackward0>) <PowBackward0 object at 0x000001D74AEFBE50>  z: tensor([[3., 3.], [3., 3.]], grad_fn=<MulBackward0>) out: tensor(3., grad_fn=<MeanBackward0>)  Initial a Values are: tensor([[0.5364, 0.5926], [0.5702, 0.7497]]) After operation a Values are: tensor([[1.0474, 1.1163], [1.0894, 1.2855]]) False True <SumBackward0 object at 0x000001D745FEDF70>  tensor([[3., 3.], [3., 3.]])  tensor([[1., 1.], [1., 1.]])  tensor([ 0.4216, 0.1233, 0.3729], requires_grad=True) y: tensor([ 863.4903, 252.5478, 763.7181], grad_fn=<MulBackward0>) i: 10 x.grad: tensor([2.0480e+02, 2.0480e+03, 2.0480e01]) True True False  x: tensor([1.], requires_grad=True) tensor([1.]) False runfile('D:/Desktop Files/matrix/code/Torch/grad.py', wdir='D:/Desktop Files/matrix/code/Torch') tensor([[1., 1.], [1., 1.]], requires_grad=True)  tensor([[1., 1.], [1., 1.]], grad_fn=<PowBackward0>) <PowBackward0 object at 0x000001D74AEFBA30>  z: tensor([[3., 3.], [3., 3.]], grad_fn=<MulBackward0>) out: tensor(3., grad_fn=<MeanBackward0>)  Initial a Values are: tensor([[ 0.1064, 1.0084], [0.2516, 0.4749]]) After operation a Values are: tensor([[0.3570, 1.5063], [ 0.6030, 0.9660]]) False True <SumBackward0 object at 0x000001D745593FD0>  tensor([[3., 3.], [3., 3.]])  tensor([[1., 1.], [1., 1.]])  tensor([0.8706, 1.1828, 0.8192], requires_grad=True) y: tensor([ 891.5447, 1211.1826, 838.8481], grad_fn=<MulBackward0>) i: 9 x.grad: tensor([1.0240e+02, 1.0240e+03, 1.0240e01]) True True False  x: tensor([1.], requires_grad=True) tensor([1.]) False tensor([100.], requires_grad=True) tensor([2.])
Reference
(1)Nonscalar reverse propagation in 2.5.4 PyTorch
(2) pytorch notes for datawhale