Nine skills of training fast neural network with pytoch

Keywords: Pytorch Deep Learning

This ultimate guide, from simple to complex, teaches you to clear all GP models in the model step by step until you can complete most PITA modifications to make full use of your network.

In fact, your model may still be at the stone age level. It is estimated that you are still training with 32-bit accuracy or GASP (general active simulation language), or even only on a single GPU. If there are 99 acceleration guides on the market, but you may only have seen one? (yes, that's it). But this ultimate guide will teach you to clear all (GP models) in the model step by step.

This guide goes from simple to complex, to most PITA modifications you can make to make the most of your network. The example will include some Python codes and related tags, which can be used in the python lightning trainer in case you don't want to knock the codes yourself!

Who is this compass right? Anyone who studies non trivial deep learning models with Pytorch, such as industrial researchers, doctoral students, scholars, etc... these models may take days, even weeks or months to train.

This article covers the following contents (from easy to difficult):

Using DataLoader
Number of processes in DataLoader
Batch size
Cumulative gradient
Keep calculation chart
Go to single GPU
16 bit mixed accuracy training
Go to multi GPU (model copy)
Go to multi GPU node (8+GPUs)
Thinking and skills about model acceleration

Pytorch-Lightning

Various optimizations discussed in this article can be found in pytoch lightning: https://github.com/williamFalcon/pytorch-lightning?source=post_page

Lightning is an optical Wrapper Based on Pytorch. It can help researchers automatically train models, but the key model components are still completely controlled by researchers.

Refer to this tutorial for a more powerful example: https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/single_gpu_node_template.py?source=post_page

Lightning uses the latest and cutting-edge methods to minimize the possibility of making mistakes.

The Lightning model defined by MNIST can be applied to the trainer: https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/lightning_module_template.py?source=post_page

from pytorch-lightning import Trainer
model = LightningModule(...)
trainer = Trainer()
trainer.fit(model)

1. DataLoader

This is probably the easiest place to speed up. Gone are the days when saving h5py or numpy files to speed up data loading. Loading image data with Python dataloader is very simple: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html?source=post_page

For NLP data, please refer to TorchText: https://torchtext.readthedocs.io/en/latest/datasets.html?source=post_page

dataset = MNIST(root=self.hparams.data_root, train=train, download=True)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader: 
  x, y = batch
  model.training_step(x, y)
  ...

In Lightning, you don't need to specify a training cycle, just define dataLoaders, and the trainer will call them when needed.

2. Number of processes in dataloaders

The second secret to speed up is to allow batch parallel loading. Therefore, you can load many batches at a time instead of one at a time.

# slow
loader = DataLoader(dataset, batch_size=32, shuffle=True)
# fast (use 10 workers)
loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=10)

3. Batch size

Before starting the next optimization step, increase the batch size to the maximum allowed by CPU memory or GPU memory.

The next section will focus on reducing the memory footprint so that you can continue to increase the batch size.

Remember, you probably need to update your learning rate again. If you double the batch size, you'd better double the learning speed.

4. Cumulative gradient

If the computational resources have been maximized and the batch size is still too low (assuming 8), we need to simulate a larger batch size for gradient descent for accurate estimation.

Suppose you want the batch size to reach 128. Then, 16 forward and backward propagation (batch size 8) are performed before a single optimizer step.

# clear last step
optimizer.zero_grad()

# 16 accumulated gradient steps
scaled_loss = 0
for accumulated_step_i in range(16): 
     out = model.forward()
     loss = some_loss(out,y)    
     loss.backward()     

       scaled_loss += loss.item()

# update weights after 8 steps. effective batch = 8*16
optimizer.step()

# loss is now scaled up by the number of accumulated batches
actual_loss = scaled_loss / 16properties

In Lightning, these have been executed automatically. Just set the tag:

trainer = Trainer(accumulate_grad_batches=16)
trainer.fit(model)

5. Keep the calculation diagram

Exploding the memory is very simple, as long as the pointer to the calculation graph is not released, such as... Saving loss for the log.

losses = []

...
losses.append(loss)

print(f'current loss: )

The problem with the above is that loss still has a copy of the graph. In this case, you can release it with. item().

# bad
losses.append(loss)

# good
losses.append(loss.item())

Lightning takes special care that it cannot keep a copy of the drawing. Example: https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/models/trainer.py#L812

6. Single GPU training

Once you have completed the previous steps, you can enter GPU training. The training of GPU will carry out parallel processing of mathematical calculations on many GPU cores. How much can be accelerated depends on the type of GPU used. 2080Ti is recommended for personal use and V100 for company use.

You may feel a lot of pressure at first, but you only need to do two things: 1) move your model to the GPU, and 2) import the data to the GPU when running data with it.

# put model on GPU
model.cuda(0)

# put data on gpu (cuda on a variable returns a cuda copy)
x = x.cuda(0)

# runs on GPU now
model(x)

If you use Lightning, you don't need to do anything with the code. Just set the tag:

# ask lightning to use gpu 0 for training
trainer = Trainer(gpus=[0])
trainer.fit(model)

During GPU training, pay attention to limiting the amount of transmission between CPU and GPU.

# expensive
x = x.cuda(0)

# very expensive
x = x.cpu()
x = x.cuda(0)

For example, if you run out of memory, do not move data back to the CPU in order to save memory. Try to optimize the code in other ways, or allocate the code across GPUs before using this method.

In addition, pay attention to the operation of forcing GPUs synchronization. For example, clear the memory cache.

# really bad idea.Stops all the GPUs until they all catch up
torch.cuda.empty_cache()

However, if lightning is used, this problem may occur only when defining lightning modules. Lightning takes special care to avoid such errors.

7. 16 bit accuracy

16 bit precision can effectively reduce the memory occupation by half. Most models are trained with 32-bit precision. However, recent studies have found that the model can also work well with 16 bit accuracy. Mixed precision means that some specific models are trained with 16 bits, while those of weight classes are trained with 32 bits.

To use 16 bit precision in Python, first install the apex library from NVIDIA and make these changes to your model.

# enable 16-bit on the model and the optimizer
model, optimizers = amp.initialize(model, optimizers, opt_level='O2')

# when doing .backward, let amp do it so it can scale the loss
with amp.scale_loss(loss, optimizer) as scaled_loss:                       
    scaled_loss.backward()

The amp package handles most things. If the gradient explodes or tends to zero, it will even expand loss.

In Lightning, using 16 bits is very simple. You don't need to make any changes to your model or complete the above operations.

trainer = Trainer(amp_level='O2', use_amp=False)
trainer.fit(model)

8. Move to multiple GPU s

Now, things become interesting. There are three (maybe more?) ways to train multiple GPU s.

Batch training

A) Copy the model on each GPU; B) Assign a portion of the batch to each GPU.
The first method is called batch training. This strategy copies the model to each GPU, and each GPU will be divided into a part of the batch.

# copy model on each GPU and give a fourth of the batch to each
model = DataParallel(model, devices=[0, 1, 2 ,3])

# out has 4 outputs (one for each gpu)
out = model(x.cuda(0))

In Lightning, you can directly instruct the trainer to increase the number of GPU s without completing any of the above operations.

# ask lightning to use 4 GPUs for training
trainer = Trainer(gpus=[0, 1, 2, 3])
trainer.fit(model)

Sub model training

Allocate different parts of the model to different GPU s and allocate batches in order
Sometimes the model may be too large for memory. For example, the Sequence to Sequence model with encoder and decoder may occupy 20gb of memory when generating output. In this case, we want to put the encoder and decoder on a separate GPU.

# each model is sooo big we can't fit both in memory
encoder_rnn.cuda(0)
decoder_rnn.cuda(1)

# run input through encoder on GPU 0
out = encoder_rnn(x.cuda(0))

# run output through decoder on the next GPU
out = decoder_rnn(x.cuda(1))

# normally we want to bring all outputs back to GPU 0
out = out.cuda(0)

For this type of training, there is no need to assign the Lightning trainer to any GPU. On the contrary, just import your own module into the Lightning module of the correct GPU:

class MyModule(LightningModule):

def __init__(): 
        self.encoder = RNN(...)
        self.decoder = RNN(...)

def forward(x):
    # models won't be moved after the first forward because 
        # they are already on the correct GPUs
        self.encoder.cuda(0)
        self.decoder.cuda(1)     
   
out = self.encoder(x)
        out = self.decoder(out.cuda(1))

# don't pass GPUs to trainer
model = MyModule()
trainer = Trainer()
trainer.fit(model)

Mix two training methods
In the above example, the encoder and decoder can still benefit from parallelizing each operation. We can be more creative now.

# change these lines
self.encoder = RNN(...)
self.decoder = RNN(...)

# to these
# now each RNN is based on a different gpu set
self.encoder = DataParallel(self.encoder, devices=[0, 1, 2, 3])
self.decoder = DataParallel(self.encoder, devices=[4, 5, 6, 7])

# in forward...
out = self.encoder(x.cuda(0))

# notice inputs on first gpu in device
sout = self.decoder(out.cuda(4))  # <--- the 4 here

Precautions when using multiple GPUs

If model.cuda() already exists on the device, it will not do anything.

Always enter on the first device in the device list.

Transferring data across devices is very expensive and should not be done as a last resort.

The optimizer and gradient will be stored on GPU 0. Therefore, GPU 0 is likely to use much more memory than other processors.

9. Multi node GPU training

Each GPU on each machine can obtain a copy of the model. Each machine is assigned a part of the data and trained only for that part of the data. The machines are synchronized with each other.

If you do this, you can train the Imagenet dataset in a few minutes! This is not as difficult as expected, but it requires more knowledge about computing clusters. These instructions assume that you are using SLURM on the cluster.

Pytorch replicates the model across nodes on each GPU and synchronizes the gradient, so as to realize multi node training. Therefore, each model is initialized independently on each GPU. In essence, it is trained independently on a partition of the data, but they all receive gradient updates from all models.

Advanced stage:

Initialize a copy of the model on each GPU (ensure that the seed is set so that each model is initialized to the same weight, otherwise the operation will fail.)

Divide the dataset into subsets. Each GPU trains only on its own subset.

On. Backward() all copies receive copies of each model gradient. Only then will the models communicate with each other.

Pytorch has a good abstract concept called distributed data parallel processing, which can complete this operation for you. To use DDP (distributed data parallel processing), you need to do four things:

def tng_dataloader(,m):
     
d = MNIST()
     # 4: Add distributed sampler
     # sampler sends a portion of tng data to each machine
     dist_sampler = DistributedSampler(dataset)
     dataloader = DataLoader(d, shuffle=False, sampler=dist_sampler)

def main_process_entrypoint(gpu_nb): 
     # 2: set up connections  between all gpus across all machines
     # all gpus connect to a single GPU "root"
     # the default uses env://
     world = nb_gpus * nb_nodes
     dist.init_process_group("nccl", rank=gpu_nb, world_size=world)
    
     # 3: wrap model in DPP
     torch.cuda.set_device(gpu_nb)
     model.cuda(gpu_nb)
     model = DistributedDataParallel(model, device_ids=[gpu_nb])
    
     # train your model now...

if  __name__ == '__main__': 
     # 1: spawn number of processes
     # your cluster will call main for each machine
     mp.spawn(main_process_entrypoint, nprocs=8)

The pytoch team has a detailed practical tutorial on this: https://github.com/pytorch/examples/blob/master/imagenet/main.py?source=post_page

However, in Lightning, this is a built-in function. Just set the node number flag and leave the rest to Lightning.

# train on 1024 gpus across 128 nodes
trainer = Trainer(nb_gpu_nodes=128, gpus=[0, 1, 2, 3, 4, 5, 6, 7])

Lightning also comes with a SlurmCluster manager to help you simply submit the correct details of SLURM tasks. Example: https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_cluster_template.py#L103-L134

10. Benefits! Faster multi GPU single node training

Facts have proved that distributed data parallel processing is much faster than data parallel processing, because its only communication is gradient synchronization. Therefore, it is best to replace data parallel processing with distributed data parallel processing, even if it is only doing single machine training.

In Lightning, this can be easily achieved by setting distributed_backend to ddp (distributed data parallel processing) and setting the number of GPU s.

# train on 4 gpus on the same machine MUCH faster than DataParallel
trainer = Trainer(distributed_backend='ddp', gpus=[0, 1, 2, 3])

Thinking and skills about model acceleration
How to think through finding bottlenecks? The model can be divided into several parts:

First, make sure there is no bottleneck in data loading. To do this, you can use the existing data loading scheme mentioned above, but if there is no suitable scheme for you, you can use offline processing and cache as high-performance data storage, just like h5py.

Next, let's look at what to do during training. Ensure fast forwarding, avoid redundant calculations, and minimize data transmission between CPU and GPU. Finally, avoid reducing the speed of GPU (described in this guide).

Next, maximize the batch size. Generally speaking, the memory size of GPUs will limit the batch size. From this point of view, this is actually distributed across GPUs, but to minimize latency and use large batches effectively (for example, in a dataset, you may obtain an effective batch size of 8000 + on multiple GPUs).

But we need to handle large batches carefully. Consult the literature according to specific problems and learn how others handle them!

Original link: https://towardsdatascience.com/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565

Reference

(1) Blog translation: https://towardsdatascience.com/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565
(2) Translated articles: https://mp.weixin.qq.com/s/i1Id9XFwogTlvknx3zsxEg
(3) Good article: 13 PyTorch features you must know
chinese: https://mp.weixin.qq.com/s/JEtZiZAn-358uyfCTqy6kw
english: https://zablo.net/blog/post/pytorch-13-features-you-should-know/

Posted by Jibberish on Thu, 21 Oct 2021 10:37:03 -0700

Programmer Group