Train your Faster-RCNN target detection model using pytorch

Keywords: github network xml Windows

Reference the training tutorial of Mask-RCNN instance split model:

  1. Pyrtorch Official ask-RCNN Instance Split Model Training Tutorial: TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL
  2. Chinese translation of the official Mask-RCNN training tutorial: Hand-on training for your Mask R-CNN image instance segmentation model (official PyTorch tutorial)

Faster-RCNN target detection model training can be achieved by slightly modifying the training of MSK-RCNN instance segmentation model.

Related pages:

  1. torchvision comes with image classification, semantic segmentation, target detection, instance segmentation, key point detection, video classification model: TORCHVISION.MODELS
  2. Tochvision Github project address:

1. Preparations

In addition to installing pytorch and torchvision, you also need to install API pycotools for COCO
How to install pycocotools on a windows system: Install pycocotools under Windows

Import related packages and modules:

import torch
import os
import numpy as np
import cv2
import matplotlib.pyplot as plt
from torchvision import datasets, transforms
from PIL import Image
from xml.dom.minidom import parse
%matplotlib inline

2. Define datasets

I use a target detection dataset that I label using labelme and convert to voc format with two tags (mark_type_1 and mark_type_2), that is, num_classes=3
The directory structure of my voc dataset is as follows:

Examples of xml labels under the Annotations folder are:


The label contains a category (mark_type_1) Two bbox es

Define the dataset:

class MarkDataset(
    def __init__(self, root, transforms=None):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "JPEGImages"))))
        self.bbox_xml = list(sorted(os.listdir(os.path.join(root, "Annotations"))))
    def __getitem__(self, idx):
        # load images and bbox
        img_path = os.path.join(self.root, "JPEGImages", self.imgs[idx])
        bbox_xml_path = os.path.join(self.root, "Annotations", self.bbox_xml[idx])
        img ="RGB")        
        # Read file, VOC format dataset label is xml format file
        dom = parse(bbox_xml_path)
        # Get Document Element Object
        data = dom.documentElement
        # Get objects
        objects = data.getElementsByTagName('object')        
        # get bounding box coordinates
        boxes = []
        labels = []
        for object_ in objects:
            # Get the contents of the label
            name = object_.getElementsByTagName('name')[0].childNodes[0].nodeValue  # Is label, mark_type_1 or mark_type_2
            labels.append([-1]))  # Background label is 0, mark_type_1 and mark_type_2 labels are 1 and 2, respectively
            bndbox = object_.getElementsByTagName('bndbox')[0]
            xmin = np.float(bndbox.getElementsByTagName('xmin')[0].childNodes[0].nodeValue)
            ymin = np.float(bndbox.getElementsByTagName('ymin')[0].childNodes[0].nodeValue)
            xmax = np.float(bndbox.getElementsByTagName('xmax')[0].childNodes[0].nodeValue)
            ymax = np.float(bndbox.getElementsByTagName('ymax')[0].childNodes[0].nodeValue)
            boxes.append([xmin, ymin, xmax, ymax])        
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.as_tensor(labels, dtype=torch.int64)        
        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((len(objects),), dtype=torch.int64)
        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        # Since you are training a target detection network, there is no target [masks] = masks in the tutorial
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd
        if self.transforms is not None:
            # Note that target (including bbox) is also transformed\enhanced here, which is different from transforms from torchvision import
            # Https:// are examples of target transformations when RandomHorizontalFlip
            img, target = self.transforms(img, target)
        return img, target
    def __len__(self):
        return len(self.imgs)

3. Define the model

There are two ways to modify torchvision's default target detection model: the first is to use a pre-trained model and finetuning fine-tune after modifying the last layer of the network; the second is to replace the backbone network in the model as needed, such as replacing ResNet with MobileNet.
The specific use of these two methods can refer to the official tutorials at the beginning of the most article as well as the official tutorial translations, where I choose the first method.
Defining a model can be simple to use:

torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=3, pretrained_backbone=True)

You can also refer to the tutorial:

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
def get_object_detection_model(num_classes):
    # load an object detection model pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    # replace the classifier with a new one, that has num_classes which is user-defined
    num_classes = 3  # 3 class (mark_type_1,mark_type_2) + background
    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model

4. Data Enhancement

The image needs to be enhanced before it can be input to the network.It is important to note that since the Faster R-CNN model itself can handle normalization (which is normalized by default using the mean and standard deviation of ImageNet) and scale changes, there is no need to do mean/std normalization or image scaling here.
Since transforms from torchvision import can only enhance pictures, but can not change label labels for pictures at the same time, we chose to use some of the encapsulated functions in the torchvision Github project for model training and testing:

Among,,,, coco_Eval.pyWe'll use it to download these files.I placed these files next to the Faster-RCNN Target Detection Model Training.ipynb script I was writing

View, you can see that it contains an example of a target (bbox) transformation when the image is horizontally flipped (RandomHorizontalFlip):

class RandomHorizontalFlip(object):
    def __init__(self, prob):
        self.prob = prob

    def __call__(self, image, target):
        if random.random() < self.prob:
            height, width = image.shape[-2:]
            image = image.flip(-1)
            bbox = target["boxes"]
            bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
            target["boxes"] = bbox
            if "masks" in target:
                target["masks"] = target["masks"].flip(-1)
            if "keypoints" in target:
                keypoints = target["keypoints"]
                keypoints = _flip_coco_person_keypoints(keypoints, width)
                target["keypoints"] = keypoints
        return image, target

Write the corresponding data enhancement functions:

import utils
import transforms as T
from engine import train_one_epoch, evaluate
# utils, transforms, engine were just,,
def get_transform(train):
    transforms = []
    # converts the image, a PIL image, into a PyTorch Tensor
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
        # 50% chance of flipping horizontally
    return T.Compose(transforms)

5. Training Model

So far, the datasets, models, and data enhancements have all been written.After model initialization, optimizer, and learning rate adjustment strategies are selected, training can begin.After each epoch training is completed, we also evaluate the performance of the model on the test set.

from engine import train_one_epoch, evaluate
import utils

root = r'Dataset Path'

# train on the GPU or on the CPU, if a GPU is not available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# 3 classes, mark_type_1,mark_type_2,background
num_classes = 3
# use our dataset and defined transformations
dataset = MarkDataset(root, get_transform(train=True))
dataset_test = MarkDataset(root, get_transform(train=False))

# split the dataset in train and test set
# My dataset has 492 diagrams, almost training validation 4:1
indices = torch.randperm(len(dataset)).tolist()
dataset =, indices[:-100])
dataset_test =, indices[-100:])

# define training and validation data loaders
# Num_when training models in jupyter notebook The workers parameter can only be 0, otherwise an error will occur, which is commented out here
data_loader =
    dataset, batch_size=2, shuffle=True, # num_workers=4,

data_loader_test =
    dataset_test, batch_size=2, shuffle=False, # num_workers=4,

# get the model using our helper function
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=num_classes, pretrained_backbone=True)  # Or get_object_detection_model(num_classes)

# move model to the right device

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]

optimizer = torch.optim.SGD(params, lr=0.0003,
                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler
# cos learning rate
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=1, T_mult=2)

# let's train it for   epochs
num_epochs = 31

for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    # Engine.pyTrain_ofOne_The epoch function takes both images and targets. to(device)
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=50)

    # update the learning rate

    # evaluate on the test dataset    
    evaluate(model, data_loader_test, device=device)    

print("That's it!")

You can see that the learning rate of the first epoch is not set at 0.0003, but increases gradually from 0 because of the following reasons:Engine.pyTrain_ofOne_In the epoch function, the first epoch uses warmup's learning rate:

if epoch == 0:
        warmup_factor = 1. / 1000
        warmup_iters = min(1000, len(data_loader) - 1)

        lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)

Additionally, since the bbox area in my dataset is large, both AP and AR are -1.000 when area= small

Finally, save the model:, r'Save Path\modelname.pkl')

6. View the effect

Draw bbox with opencv:

def showbbox(model, img):
    # The img entered is a tensor in the 0-1 range        
    with torch.no_grad():
        prediction Like:
        [{'boxes': tensor([[1492.6672,  238.4670, 1765.5385,  315.0320],
        [ 887.1390,  256.8106, 1154.6687,  330.2953]], device='cuda:0'), 
        'labels': tensor([1, 1], device='cuda:0'), 
        'scores': tensor([1.0000, 1.0000], device='cuda:0')}]
        prediction = model([])
    img = img.permute(1,2,0)  # C,H,W_H,W,C, for drawing
    img = (img * 255).byte().data.cpu()  # * 255, float to 0-255
    img = np.array(img)  # tensor → ndarray
    for i in range(prediction[0]['boxes'].cpu().shape[0]):
        xmin = round(prediction[0]['boxes'][i][0].item())
        ymin = round(prediction[0]['boxes'][i][1].item())
        xmax = round(prediction[0]['boxes'][i][2].item())
        ymax = round(prediction[0]['boxes'][i][3].item())
        label = prediction[0]['labels'][i].item()
        if label == 1:
            cv2.rectangle(img, (xmin, ymin), (xmax, ymax), (255, 0, 0), thickness=2)
            cv2.putText(img, 'mark_type_1', (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 0, 0),
        elif label == 2:
            cv2.rectangle(img, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=2)
            cv2.putText(img, 'mark_type_2', (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0),

View the effect:

model = torch.load(r'Save Path\modelname.pkl')
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

img, _ = dataset_test[0] 
showbbox(model, img)

Posted by somedude on Sat, 23 May 2020 18:56:12 -0700