PyTorch and dynamic learning rate application - with code

Keywords: neural networks Pytorch Deep Learning

Stick to blogging and share your gains in study and work

  1. Make a note of yourself
  2. Record and summarize knowledge points to deepen understanding
  3. Give some help to people in need, step less on a pit and take more steps

Try to arrange the layout in an appropriate way, with both graphics and text
If you write wrong or don't understand, you can leave a message in the comment area
If the content is helpful to you, you are welcome to like it 👍 Collection ⭐ Leaving a message. 📝.
Although the platform will not have any reward, I will be very happy and can keep my enthusiasm for blogging

TORCH.OPTIM

torch.optim is a package that implements various optimization algorithms. Most common methods have been supported

How to use the optimizer

To use torch.optim, you must construct an optimizer object that will save the current state and update the parameters according to the calculated gradient.

Construct optimizer

In order to construct an optimizer, you must give it an iteratable object containing parameters (all should be variables s) for optimization. You can then specify optimizer specific options, such as learning rate, weight attenuation, and so on.

Note: if you need to move the model to the GPU through. cuda(), do so before building the optimizer for it. The model parameters after. cuda() are different objects from those before the call.
In general, when constructing and using an optimizer, you should ensure that the optimization parameters are in a consistent position.

For example:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr=0.0001)

Specify learning rate per layer

In the above optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9), the optimizer is constructed to set the same learning rate for the parameters of the whole model;
When you need to specify different learning rates for each layer, you can use optimizer = optim.Adam([var1, var2], lr=0.0001), where var1 and var2 should be defined as dict and should contain a params key

optim.Adam([
            {'params': model.base.parameters()},
            {'params': model.classifier.parameters(), 'lr': 1e-3}
            ], lr=1e-2)

The default learning rate 1e-2 will be used for the parameters of model.base and 1e-3 will be used for the parameters of model.classifier

optimization step

All optimizers implement a step() method to update parameters.

Once the gradient is calculated using backward(), you can call this function.

loss.backward()
optimizer.step()

Base class torch.optim.Optimizer

torch.optim.Optimizer(params, defaults) is the base class of the optimizer

method

The Optimizer base class implements the following main methods:

Optimizer.add_param_group(param_group): param to the optimizer_ Groups add a parameter group. This is very useful when fine tuning the pre training network. The frozen layer is set to be trainable and added to the optimizer during the training process. param_ Group is a dict

Optimizer.load_state_dict(state_dict): load optimizer state. state_ Dict is a dict. It should call optimizer.state_ Return value of dict()

Optimizer.state_dict(): returns the status of the optimizer as a dictionary.

Optimizer.step(): perform a single optimization step to update parameters.

Optimizer.zero_grad(set_to_none=False): set the gradient to 0. set_ to_ None: set the gradient value to none instead of 0. This usually results in lower memory footprint and modest performance improvements. However, it changes some behavior. For example: 1. When the user tries to access the gradient and perform manual operation on it, the none attribute behaves differently from the tensor of all zeros. 2. If the user calls zero_grad(set_to_none=True) and back propagation, then. grad is guaranteed to be none for parameters that do not receive a gradient. 3. If the gradient is 0 or none, the torch.optim optimizer will have different behavior (in one case, it executes the step with a gradient of 0, and in another case, it skips the step completely).

algorithm

SGD

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)

Adam

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

more

https://pytorch.org/docs/stable/optim.html#algorithms

Adjust learning rate

torch.optim.Lr_scheduler provides several methods to adjust learning rate based on epoch number.

Learning rate scheduling should be applied after the optimizer is updated

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler = ExponentialLR(optimizer, gamma=0.9)

for epoch in range(20):
    for input, target in dataset:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler.step()

You can use the following template to reference the scheduler algorithm:

scheduler = ...
for epoch in range(100):
    train(...)
    validate(...)
    scheduler.step()

Note: before PyTorch 1.1.0, the learning rate scheduler should be called before the optimizer is updated. 1.1.0 changed this behavior in a BC breaking way. If you use the learning rate scheduler (call scheduler.step()) before the optimizer updates (call optimizer.step()), this skips the first value of the learning rate scheduler. If you cannot copy the results after upgrading to pytorch 1.1.0, check whether scheduler.step() was called at the wrong time.

Dynamic learning rate application

Can refer to Dynamically adjust learning rate , change strategies with different learning rates.

The following code runs in Jupiter

import torch
import torchvision
from torchvision.datasets import CIFAR10
from torchvision import transforms
from torch import optim
import torch.nn as nn
import torch.nn.functional as F

import numpy as np
import matplotlib.pyplot as plt

Testing equipment

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

Data download and processing

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = CIFAR10(root='./CIFAR10', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = CIFAR10(root='./CIFAR10', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Define network

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()
net.to(device)

Define optimizer and train

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

lrs = []
steps = []
for epoch in range(20):

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data[0].to(device), data[1].to(device)
        
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0
    
    steps.append(epoch)
    lrs.append(scheduler.get_last_lr()[0])
    scheduler.step()

print('Finished Training')

Visual learning rate

plt.plot(steps, lrs)
plt.xlabel("epoch")
plt.ylabel("lr")
plt.title("learning rate's curve changes as epoch goes on!")
plt.show()

reference resources: https://pytorch.org/docs/stable/optim.html

If the content is helpful to you, or you think it is well written
🏳️‍🌈 Welcome to praise 👍 Collection ⭐ Leaving a message. 📝
If you have any questions, please leave a message in the comment area

Posted by kevingarnett2000 on Mon, 11 Oct 2021 14:59:32 -0700