Deep learning from introduction to mastery -- MTCNN face detection algorithm

Keywords: Deep Learning image identification Object Detection CNN

Look at the effect first

MTCNN

Since the MTCNN algorithm came out in 2016, the industry has become very popular. Recently, I tried to reproduce the paper code.

Subject thought

Cascade network

**
This paper belongs to a multi task cascade convolutional neural network, as shown in the figure, which uses P, R and O networks for detection.
Algorithm steps:

Send the data to the p network, pre select some boxes, and generate the characteristic diagram
Pass the preselected box in the p network into the R network for further filtering
Transfer the preselected box in R network to O network for further filtering
O network for final judgment and output face frame.

The above steps are implemented under the general framework of image pyramid. In order to realize face detection at different scales, the image pyramid algorithm is used to transfer faces at different scales into the network. Although this may improve the accuracy of the model, it also leads to an increase in the number of model operations.
P network:
Since the P network does not know how many faces a picture contains, the full convolution network is used to receive pictures of any size, the number of channels is used as the output feature number, the image pyramid is used to scale the picture (the shortest side length is greater than 12), and the convolution kernel step size is set to 1, which does not miss any possibility.
The 12x12 large core is used to determine the minimum face detected by the network. Cut a picture of the same size in a fixed size box (output fixed).
The convolution network is used to cut 12x12 pictures. The feature extraction of multi-layer small convolution kernel neural network is equivalent to large core single-layer neural network (the same receptive field). It has stronger abstraction ability, finer feature extraction, fewer parameters and faster network operation. The offset is used as the training loss, which makes the network calculation more convenient;.
Different loss functions are set for different tasks to enhance the pertinence of learning. Add facial features tasks, increase the complexity of e-learning tasks, and clarify the direction of e-learning. Each additional loss will improve the ability of neural network feature extraction and promote the optimization of network model.
R network:
The feature extraction of multi-layer small convolution kernel neural network is equivalent to large core single-layer neural network (the same receptive field).
Since the input is a 24 x 24 fixed size picture, the output can use full connection to extract features (the same as using convolution kernel to extract features here).
O network:
The feature extraction of multi-layer small convolution kernel neural network is equivalent to large core single-layer neural network (the same receptive field).
Since the input is a 48 x 28 fixed size picture, the output can use full connection to extract features (the same as using convolution kernel to extract features here).

image pyramid

There will be multiple targets in a picture, and the sizes of the targets are different. The model can not detect some too small or too large targets, so it is necessary to emphasize some multi-scale information. MTCNN The zoom factor for the image in is 0.703 about.

In addition, we have to mention two algorithms, IOU and NMS

IOU algorithm

Sample requirements for data set produced by MTCNN:

IOU range sample type label
0 ~ 0.3 non human face confidence is 0 (not involved in calculating offset loss)
0.3 ~ 0.4 landmarks (buffer zone to prevent misjudgment)
0.4 ~ 0.65 Part face confidence is 2 (does not participate in the calculation of confidence loss)
0.65 ~ 1 face confidence is 1

IoU is a standard for measuring the accuracy of detecting corresponding objects in a specific data set. IoU is a simple measurement standard. As long as it is a task to obtain a bounding box in the output, IoU can be used for measurement. In order to use IoU to measure objects of any size and shape, we need to mark the approximate range of the object to be detected in the training set image. That is, the a priori box.
In other words, this standard is used to measure the correlation between reality and prediction. The higher the correlation, the higher the value.
Fastercnn also proposed that the similar frame is similar to the linear transformation relationship. Using this, it is easier to learn the mapping relationship between the a priori frame and the actual frame than the frame of the direct learning target.

iou formula

The larger the IOU, the greater the degree of overlap

def iou(box,boxes,isMin=False):

    box_area = (box[3]-box[1])*(box[2]-box[0])
    boxes_area = (boxes[:,3]-boxes[:,1])*(boxes[:,2]-boxes[:,0])


    xx1 = np.maximum(box[0],boxes[:,0])
    yy1 = np.maximum(box[1],boxes[:,1])
    xx2 = np.minimum(box[2],boxes[:,2])
    yy2 = np.minimum(box[3],boxes[:,3])

    w = np.maximum(0, (xx2-xx1))
    h = np.maximum(0,(yy2-yy1))

    area =w*h

    if isMin:
        return np.true_divide(area,np.minimum(box_area,boxes_area))
    else:
        return np.true_divide(area,box_area+boxes_area-area)

nms algorithm

-The direct problem is to solve the problem of multiple output boxes

Select the box with the maximum confidence
The remaining frames are iou intersected and compared with the maximum confidence frame
Filter the boxes whose iou is greater than a certain threshold, so that a large number of duplicate boxes can be removed
The code implementation is as follows:

def nms(boxes, thresh=0.3, isMin = False):
    #When the length of the box is 0 (to prevent the program from reporting errors due to defects)
    if boxes.shape[0] == 0:
        return np.array([])

    #When the length of the box is not 0
    #Sort according to confidence: [x1,y1,x2,y2,C]
    _boxes = boxes[(-boxes[:, 4]).argsort()] # #According to the confidence "from large to small", the default is from small to large (add a symbol to reverse the order)
    #Create an empty list to hold the remaining boxes
    r_boxes = []
    # Compare 1st boxes with other boxes, and stop when the length is less than or equal to 1 (compared with len(_boxes)-1 time)
    while _boxes.shape[0] > 1: #shape[0] is equivalent to shape(0), representing the number (dimension) of boxes on the 0 axis
        #Take out the first box
        a_box = _boxes[0]
        #Remove the remaining boxes
        b_boxes = _boxes[1:]

        #Add 1st boxes to the list
        r_boxes.append(a_box) ##Add a box for each cycle
        # print(iou(a_box, b_boxes))

        #Compare IOU S and keep the boxes that meet the threshold conditions
        index = np.where(iou(a_box, b_boxes,isMin) < thresh) #Keep the suggestion box whose threshold is less than 0.3 and return the index of the reserved box
        _boxes = b_boxes[index] #Cycle control conditions; Take out the suggestion box whose threshold is less than 0.3

    if _boxes.shape[0] > 0: ##For the last time, only 1st matches or only one match is used in the result, if the number of boxes is greater than 1; ★ here_ The calls from boxes are in the while x loop. This judgment condition can be placed inside and outside the loop (only outside the function class can produce a local effect on)
        r_boxes.append(_boxes[0]) #Add this box to the list
    #Stack assembled as a matrix:: stack the data in the list on the 0 axis (row direction)
    return np.stack(r_boxes)

After the above, the basic knowledge has been completed, and the practical operation code is left:
Network model code implementation:
Here I implemented the modification:

BN is used to facilitate the faster convergence of the training model.
The maximum pooling in each model is transformed into the method of convolution step size = 2 (the amount of data is basically sufficient, so there is no need to worry about over fitting, but this will increase the amount of calculation). The effect is really much better.

from torch import nn
import torch

class PNet(nn.Module):
    def __init__(self):
        super(PNet,self).__init__()
        self.name = "pNet"
        self.pre_layer = nn.Sequential(
            nn.Conv2d(in_channels=3,out_channels=10,kernel_size=(3,3),stride=(1,1),padding=(1,1)), # conv1
            nn.PReLU(),
            # prelu1
            nn.Conv2d(in_channels=10,out_channels=10,kernel_size=(3,3),stride=(2,2)),    
            nn.Conv2d(10,16,kernel_size=(3,3),stride=(1,1)), # conv2
            nn.PReLU(),                              # prelu2
            nn.Conv2d(16,32,kernel_size=(3,3),stride=(1,1)), # conv3
            nn.PReLU()                               # prelu3
        )

        self.conv4_1 = nn.Conv2d(32,1,kernel_size=(1,1),stride=(1,1))
        self.conv4_2 = nn.Conv2d(32,4,kernel_size=(1,1),stride=(1,1))



    def forward(self, x):
        x = self.pre_layer(x)
        cond = torch.sigmoid(self.conv4_1(x)) # The confidence is activated with sigmoid (when BCEloos is used, it must be activated with sigmoid first)
        offset = self.conv4_2(x)         # The offset does not need to be activated and is output as is
        return cond,offset
# R network
class RNet(nn.Module):
    def __init__(self):
        super(RNet,self).__init__()
        self.name = "RNet"
        self.pre_layer = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=28, kernel_size=(3,3),stride=(1,1),padding=(1,1)), # conv1
            nn.BatchNorm2d(28),
            nn.PReLU(),                                                                  # prelu1
            nn.Conv2d(in_channels=28, out_channels=28, kernel_size=(3, 3), stride=(2, 2)), 
            nn.BatchNorm2d(28),
            nn.PReLU(),
            # pool1
            nn.Conv2d(28, 48, kernel_size=(3,3),stride=(1,1)),  # conv2
            nn.BatchNorm2d(48),
            nn.PReLU(),                                  # prelu2
            nn.Conv2d(in_channels=48, out_channels=48, kernel_size=(3, 3), stride=(2, 2)), 
            nn.BatchNorm2d(48),
            nn.PReLU(),
            nn.Conv2d(48, 64, kernel_size=(2,2), stride=(1,1)),          # conv3
            nn.BatchNorm2d(64),
            nn.PReLU()                                           # prelu3
        )
        self.conv4 = nn.Linear(64*3*3,128) # conv4
        self.prelu4 = nn.PReLU()           # prelu4
        #detetion
        self.conv5_1 = nn.Linear(128,1)
        #bounding box regression
        self.conv5_2 = nn.Linear(128, 4)

    def forward(self, x):
        #backend
        x = self.pre_layer(x)
        x = x.view(x.size(0),-1)
        x = self.conv4(x)
        x = self.prelu4(x)
        #detection
        label = torch.sigmoid(self.conv5_1(x)) # Confidence
        offset = self.conv5_2(x) # Offset
        return label,offset

# O network
class ONet(nn.Module):
    def __init__(self):
        super(ONet,self).__init__()
        self.name = "oNet"
        # backend
        self.pre_layer = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=(3,3), stride=(1,1),padding=(1,1)),  # conv1
            nn.BatchNorm2d(32),
            nn.PReLU(),                                          # prelu1
            nn.Conv2d(in_channels=32, out_channels=32, kernel_size=(3, 3), stride=(2, 2)),  
            nn.BatchNorm2d(32),
            nn.PReLU(),
            nn.Conv2d(32, 64,kernel_size=(3,3), stride=(1,1)),  # conv2
            nn.BatchNorm2d(64),
            nn.PReLU(),  # prelu2
            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=(3, 3), stride=(2, 2)),  
            nn.BatchNorm2d(64),
            nn.PReLU(),  # prelu2
            nn.Conv2d(64, 64, kernel_size=(3,3), stride=(1,1)),        # conv3
            nn.BatchNorm2d(64),
            nn.PReLU(),  # prelu3
            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=(2, 2), stride=(2, 2)),  
            nn.BatchNorm2d(64),
            nn.PReLU(),  # prelu3
            nn.Conv2d(64, 128,kernel_size=(2,2), stride=(1,1)),  # conv4
            nn.PReLU(),  # prelu4
        )
        self.conv5 = nn.Linear(128 * 3 * 3, 256)  # conv5
        self.prelu5 = nn.PReLU()                 # prelu5
        # detection
        self.conv6_1 = nn.Linear(256, 1)
        # bounding box regression
        self.conv6_2 = nn.Linear(256, 4)

    def forward(self, x):
        # backend
        x = self.pre_layer(x)
        x = x.reshape(x.size(0), -1)
        x = self.conv5(x)
        x = self.prelu5(x)
        # detection
        label = torch.sigmoid(self.conv6_1(x)) # Confidence
        offset = self.conv6_2(x)          # Offset

        return label, offset


if __name__ == '__main__':

    x = torch.randn(2,3,12,12)
    x2 = torch.randn(2,3,24,24)
    x3 = torch.randn(2,3,48,48)
    model1 = PNet()
    print(model1(x)[0].shape)
    print(model1(x)[1].shape)
    model2 = RNet()
    print(model2(x2)[0].shape)
    print(model2(x2)[1].shape)
    model3 = ONet()
    print(model3(x3)[0].shape)
    print(model3(x3)[1].shape)
    print(model1)
    print(model2)
    print(model3)
    print(list(model1.pre_layer[0].weight))

Data conversion code: here are the data celeba and widerface in the paper
First, put the box and landmarks in the celeba data together

 landmarks_path = r"F:\CelebA\Anno\list_landmarks_align_celeba.txt"
    bbox_path = r"F:\CelebA\Anno\list_bbox_celeba.txt"

    save_path = "anno.txt"
    with open(landmarks_path,"r") as f:
        landmarks = f.readlines()
    with open(bbox_path,"r") as f:
        bbox = f.readlines()
    with open(save_path,"w") as f:
        for i,(line1,line2) in enumerate(zip(bbox,landmarks)):
            if i<1:
                f.write(line1)
            elif i==1:
                strs = line1.strip()+" "+ line2
                f.write(strs)
                # f.write(line2)
            else:
                strs = line1.strip().split()+line2.strip().split()[1:]
                strs = " ".join(strs)+"\n"
                f.write(strs)

import os
import traceback

import numpy as np
from PIL import Image, ImageDraw

from tools import utils


class GenerateData():

    def __init__(self, anno_src=r"anno.txt",
                 imgs_path=r"F:\CelebA\Img\img_celeba\img_celeba",
                 save_path="D:\DataSet"
                 ):

        self.anno_src = anno_src
        self.imgs_path = imgs_path
        self.save_path = save_path

        if not os.path.exists(self.save_path):
            os.makedirs(self.save_path)

    def run(self, size=12):
        print("gen %i image" % size)  # %i: Decimal placeholder
        for face_size in [size]:
            # "Sample picture" storage path -- image
            positive_image_dir = os.path.join(self.save_path, str(face_size), "positive")  # Tertiary file path
            negative_image_dir = os.path.join(self.save_path, str(face_size), "negative")
            part_image_dir = os.path.join(self.save_path, str(face_size), "part")

            print(positive_image_dir, negative_image_dir, part_image_dir)
            for dir_path in [positive_image_dir, negative_image_dir, part_image_dir]:
                if not os.path.exists(dir_path):  # If the file does not exist, the file path is created
                    os.makedirs(dir_path)

            # "Sample label" storage path -- text
            positive_anno_filename = os.path.join(self.save_path, str(face_size), "positive.txt")  # Creating a positive sample txt file
            negative_anno_filename = os.path.join(self.save_path, str(face_size), "negative.txt")
            part_anno_filename = os.path.join(self.save_path, str(face_size), "part.txt")

            # Initial count value: name the file
            positive_count = 0  # Counter initial value
            negative_count = 0
            part_count = 0
            # For all file operations, it's best to try to prevent program errors from running away
            try:
                positive_anno_file = open(positive_anno_filename, "w")  # Open txt document in write mode
                negative_anno_file = open(negative_anno_filename, "w")
                part_anno_file = open(part_anno_filename, "w")
                for i, line in enumerate(open(self.anno_src)):  # Give me all the information
                    if i < 2:
                        continue  # When i is less than 2, continue reading the file readlines
                    # print(i,line)
                    try:
                        # print(line)
                        strs = line.strip().split(" ")  # strip removes spaces on both sides
                        # print(strs)
                        # print(strs)
                        image_filename = strs[0].strip()
                        # print(image_filename)
                        image_file = os.path.join(self.imgs_path, image_filename)  # Create file absolute path

                        with Image.open(image_file) as img:


                            img_w, img_h = img.size
                            x1 = float(strs[1].strip())  # Take 2nd values, remove the spaces on both sides, and then transfer to float type
                            y1 = float(strs[2].strip())
                            w = float(strs[3].strip())
                            h = float(strs[4].strip())
                            x2 = float(x1 + w)
                            y2 = float(y1 + h)
                            px1 = float(strs[5].strip())  # Human facial features
                            py1 = float(strs[6].strip())
                            px2 = float(strs[7].strip())
                            py2 = float(strs[8].strip())
                            px3 = float(strs[9].strip())
                            py3 = float(strs[10].strip())
                            px4 = float(strs[11].strip())
                            py4 = float(strs[12].strip())
                            px5 = float(strs[13].strip())
                            py5 = float(strs[14].strip())

                            # Filter fields to remove unqualified coordinates
                            if max(w, h) < 40 or x1 < 0 or y1 < 0 or w < 0 or h < 0:
                                continue
                            # The annotation is not standard: offset the face frame appropriately ★
                            x1 = int(x1 + w * 0.12)  # The original coordinates are offset appropriately: 0.15 times of the face frame
                            y1 = int(y1 + h * 0.1)
                            x2 = int(x1 + w * 0.9)
                            y2 = int(y1 + h * 0.85)
                            w = int(x2 - x1)  # The actual width of the offset box
                            h = int(y2 - y1)
                            boxes = [[x1, y1, x2, y2]]  # Four coordinate points in the upper left corner and the lower right corner; The two-dimensional box has the concept of batch

                            # draw = ImageDraw.Draw(img)
                            # draw.rectangle(boxes[0],)
                            # img.show()

                            # Calculate the center position of the face: the center position of the frame
                            cx = x1 + w / 2
                            cy = y1 + h / 2
                            # Double the number of positive samples and partial samples to randomly offset the center point of the picture
                            for _ in range(2):  # Each cycle 5 times, draw five frames and pull them out
                                # Make the center point of the face offset a little
                                w_ = np.random.randint(-w * 0.1, w * 0.1)  # Lateral offset range of box: 20% to the left and right
                                h_ = np.random.randint(-h * 0.1, h * 0.1)
                                cx_ = cx + w_
                                cy_ = cy + h_

                                # Let the face form a square (12 * 12, 24 * 24, 48 * 48), and let the coordinates deviate a little
                                side_len = np.random.randint(int(min(w, h) * 0.8), np.ceil(1.25 * max(w, h)))
                                # Range of random number of side length offset; ceil is the smallest integer greater than or equal to the value (rounded up); Original 0.8


                                x1_ = np.max(cx_ - side_len / 2, 0)  # Coordinate point random offset
                                y1_ = np.max(cy_ - side_len / 2, 0)
                                x2_ = x1_ + side_len
                                y2_ = y1_ + side_len

                                crop_box = np.array([x1_, y1_, x2_, y2_])  # Offset new box
                                # draw.rectangle(list(crop_box))


                                # Calculate the offset value of the coordinates
                                offset_x1 = (x1 - x1_) / side_len  # Offset △ δ= (x1-x1_)/side_len; The width of the new frame;
                                offset_y1 = (y1 - y1_) / side_len
                                offset_x2 = (x2 - x2_) / side_len
                                offset_y2 = (y2 - y2_) / side_len

                                offset_px1 = (px1 - x1_) / side_len  # Offset value of human facial features
                                offset_py1 = (py1 - y1_) / side_len
                                offset_px2 = (px2 - x1_) / side_len
                                offset_py2 = (py2 - y1_) / side_len
                                offset_px3 = (px3 - x1_) / side_len
                                offset_py3 = (py3 - y1_) / side_len
                                offset_px4 = (px4 - x1_) / side_len
                                offset_py4 = (py4 - y1_) / side_len
                                offset_px5 = (px5 - x1_) / side_len
                                offset_py5 = (py5 - y1_) / side_len

                                # Cut the picture and zoom
                                face_crop = img.crop(crop_box)  # "Matting", clip out the framed image
                                face_resize = face_crop.resize((face_size, face_size),
                                                               Image.ANTIALIAS)  # ★ zoom according to the face size ("pixel matrix size"): 12 / 24 / 48; Coordinates are not scaled

                                iou = utils.iou(crop_box, np.array(boxes))[0]  # The extracted box and the original box calculate the IOU
                                if iou > 0.65:  # Positive samples; Originally 0.65
                                    positive_anno_file.write(
                                        "positive/{0}.jpg {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15}\n".format(
                                            positive_count, 1, offset_x1, offset_y1,
                                            offset_x2, offset_y2, offset_px1, offset_py1, offset_px2, offset_py2,
                                            offset_px3,
                                            offset_py3, offset_px4, offset_py4, offset_px5, offset_py5))
                                    positive_anno_file.flush()  # flush: writes data from the cache to a file
                                    face_resize.save(
                                        os.path.join(positive_image_dir, "{0}.jpg".format(positive_count)))  # preservation
                                    positive_count += 1
                                elif iou > 0.4:  # Partial samples; Originally 0.4
                                    part_anno_file.write(
                                        "part/{0}.jpg {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15}\n".format(
                                            part_count, 2, offset_x1, offset_y1, offset_x2,
                                            offset_y2, offset_px1, offset_py1, offset_px2, offset_py2, offset_px3,
                                            offset_py3, offset_px4, offset_py4, offset_px5, offset_py5))  # Write txt file
                                    part_anno_file.flush()
                                    face_resize.save(os.path.join(part_image_dir, "{0}.jpg".format(part_count)))
                                    part_count += 1
                                elif iou < 0.29:  # ★ few negative samples are generated in this way; Originally 0.3
                                    negative_anno_file.write(
                                        "negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count, 0))
                                    negative_anno_file.flush()
                                    face_resize.save(os.path.join(negative_image_dir, "{0}.jpg".format(negative_count)))
                                    negative_count += 1

                            # Generate negative samples
                            _boxes = np.array(boxes)
                            for i in range(2):  # The quantity is generally the same as before
                                side_len = np.random.randint(face_size, min(img_w, img_h) / 2)
                                x_ = np.random.randint(0, img_w - side_len)
                                y_ = np.random.randint(0, img_h - side_len)
                                crop_box = np.array([x_, y_, x_ + side_len, y_ + side_len])

                                if np.max(utils.iou(crop_box, _boxes)) < 0.29:  # Judge after adding IOU: keep the part less than 0.3; Originally 0.3
                                    face_crop = img.crop(crop_box)  # Matting
                                    face_resize = face_crop.resize((face_size, face_size),
                                                                   Image.ANTIALIAS)  # ANTIALIAS: smooth, antialiasing

                                    negative_anno_file.write(
                                        "negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count, 0))
                                    negative_anno_file.flush()
                                    face_resize.save(os.path.join(negative_image_dir, "{0}.jpg".format(negative_count)))
                                    negative_count += 1

                    except Exception as e:
                        print(e)
                        traceback.print_exc()

            except Exception as e:
                print(e)
                # Close write file
            finally:

                positive_anno_file.close()  # Close the positive sample

                negative_anno_file.close()

                part_anno_file.close()


if __name__ == '__main__':
    data = GenerateData()
    data.run(size=12)
    data.run(size=24)
    data.run(size=48)

widerface data code

import os
import traceback

import numpy as np
from PIL import Image, ImageDraw

from tools import utils

def gentxt():
    imgs_path = r"F:\widerface\WIDER_train\images"
    bbox_txt = r"F:\widerface\wider_face_split\wider_face_train_bbx_gt.txt"
    with open(bbox_txt,"r") as f:
        data = f.readlines()

    empty_dict = {}
    temp_name = None
    for i,line in enumerate(data):
        # i = i.strip()
        if line.strip().endswith("jpg"):
            line = line.strip()
            empty_dict[line] = []
            temp_name = line
        else:
            line = line.strip()
            if len(line)>10:
                # print(line.split()[:4])
                empty_dict[temp_name].append(line.split()[:4])

    with open("wider_anno.txt","w") as f:
        for key in empty_dict.keys():
            values = empty_dict[key]
            f.write(f"{key} ")
            for value in values:
                # print(value)
                # print(" ".join(value))
                f.write(" ".join(value))
                f.write(" ")
            f.write("\n")
            # exit()
            # f.write(f"{key}",)


class GenerateData():

    def __init__(self, anno_src=r"wider_anno.txt",
                 imgs_path=r"F:\widerface\WIDER_train\images",
                 save_path="D:\DataSet\wider"
                 ):

        self.anno_src = anno_src
        self.imgs_path = imgs_path
        self.save_path = save_path

        if not os.path.exists(self.save_path):
            os.makedirs(self.save_path)

    def run(self, size=12):
        print("gen %i image" % size)  # %i: Decimal placeholder
        for face_size in [size]:
            # "Sample picture" storage path -- image
            positive_image_dir = os.path.join(self.save_path, str(face_size), "positive")  # Tertiary file path
            negative_image_dir = os.path.join(self.save_path, str(face_size), "negative")
            part_image_dir = os.path.join(self.save_path, str(face_size), "part")

            print(positive_image_dir, negative_image_dir, part_image_dir)
            for dir_path in [positive_image_dir, negative_image_dir, part_image_dir]:
                if not os.path.exists(dir_path):  # If the file does not exist, the file path is created
                    os.makedirs(dir_path)

            # "Sample label" storage path -- text
            positive_anno_filename = os.path.join(self.save_path, str(face_size), "positive.txt")  # Creating a positive sample txt file
            negative_anno_filename = os.path.join(self.save_path, str(face_size), "negative.txt")
            part_anno_filename = os.path.join(self.save_path, str(face_size), "part.txt")

            # Initial count value: name the file
            positive_count = 0  # Counter initial value
            negative_count = 0
            part_count = 0
            # For all file operations, it's best to try to prevent program errors from running away
            try:
                positive_anno_file = open(positive_anno_filename, "w")  # Open txt document in write mode
                negative_anno_file = open(negative_anno_filename, "w")
                part_anno_file = open(part_anno_filename, "w")
                for i, line in enumerate(open(self.anno_src)):  # Give me all the information

                    try:
                        # print(line)
                        strs = line.strip().split(" ")  # strip removes spaces on both sides
                        # print(strs)
                        # print(strs)
                        image_filename = strs[0].strip()
                        # print(image_filename)
                        image_file = os.path.join(self.imgs_path, image_filename)  # Create file absolute path
                        values = list(map(float, strs[1:]))
                        all_boxes = []

                        for index in range(0, len(values), 4):
                            all_boxes.append(values[index:index + 4])

                        with Image.open(image_file) as img:
                            for one_box in all_boxes:
                                img_w, img_h = img.size
                                x1 = one_box[0]# float(strs[1].strip())  # Take 2nd values, remove the spaces on both sides, and then transfer to float type
                                y1 = one_box[1] # float(strs[2].strip())
                                w =one_box[2]#  float(strs[3].strip())
                                h =one_box[3]# float(strs[4].strip())
                                x2 = float(x1 + w)
                                y2 = float(y1 + h)

                                # draw = ImageDraw.Draw(img)
                                # draw.rectangle([x1,y1,x2,y2])
                                # img.show()
                                # exit()
                                px1 = 0#float(strs[5].strip())  # Human facial features
                                py1 = 0#float(strs[6].strip())
                                px2 =0# float(strs[7].strip())
                                py2 =0# float(strs[8].strip())
                                px3 =0# float(strs[9].strip())
                                py3 =0# float(strs[10].strip())
                                px4 =0# float(strs[11].strip())
                                py4 =0# float(strs[12].strip())
                                px5 =0# float(strs[13].strip())
                                py5 =0# float(strs[14].strip())

                                # Filter fields to remove unqualified coordinates
                                if max(w, h) < 40 or x1 < 0 or y1 < 0 or w < 0 or h < 0:
                                    continue

                                # x1 = int(x1 + w)  # The original coordinates are offset appropriately: 0.15 times of the face frame
                                # y1 = int(y1 + h)
                                # x2 = int(x1 + w)
                                # y2 = int(y1 + h)
                                # w = int(x2 - x1)  # The actual width of the offset box
                                # h = int(y2 - y1)
                                boxes = [[x1, y1, x2, y2]]  # Four coordinate points in the upper left corner and the lower right corner; The two-dimensional box has the concept of batch
                                # print(boxes)
                                # # exit()

                                # Calculate the center position of the face: the center position of the frame
                                cx = x1 + w / 2
                                cy = y1 + h / 2
                                # Double the number of positive samples and partial samples to randomly offset the center point of the picture
                                for _ in range(1):  # Each cycle 5 times, draw five frames and pull them out
                                    # Make the center point of the face offset a little
                                    # print(-w * 0.2, w * 0.2)
                                    w_ = np.random.randint(-w * 0.2, w * 0.2)  # Lateral offset range of box: 20% to the left and right
                                    h_ = np.random.randint(-h * 0.2, h * 0.2)
                                    cx_ = cx + w_
                                    cy_ = cy + h_

                                    # Let the face form a square (12 * 12, 24 * 24, 48 * 48), and let the coordinates deviate a little
                                    side_len = np.random.randint(int(min(w, h) * 0.8), np.ceil(1.25 * max(w, h)))
                                    # Range of random number of side length offset; ceil is the smallest integer greater than or equal to the value (rounded up); Original 0.8


                                    x1_ = np.max(cx_ - side_len / 2, 0)  # Coordinate point random offset
                                    y1_ = np.max(cy_ - side_len / 2, 0)
                                    x2_ = x1_ + side_len
                                    y2_ = y1_ + side_len

                                    crop_box = np.array([x1_, y1_, x2_, y2_])  # Offset new box
                                    # draw.rectangle(list(crop_box))
                                    # img.show()
                                    # exit()


                                    # Calculate the offset value of the coordinates
                                    offset_x1 = (x1 - x1_) / side_len  # Offset △ δ= (x1-x1_)/side_len; The width of the new frame;
                                    offset_y1 = (y1 - y1_) / side_len
                                    offset_x2 = (x2 - x2_) / side_len
                                    offset_y2 = (y2 - y2_) / side_len

                                    offset_px1 =0#  (px1 - x1_) / side_len  # Offset value of human facial features
                                    offset_py1 =0 #  (py1 - y1_) / side_len
                                    offset_px2 =0 #(px2 - x1_) / side_len
                                    offset_py2 =0# (py2 - y1_) / side_len
                                    offset_px3 =0# (px3 - x1_) / side_len
                                    offset_py3 = 0#(py3 - y1_) / side_len
                                    offset_px4 =0# (px4 - x1_) / side_len
                                    offset_py4 = 0#(py4 - y1_) / side_len
                                    offset_px5 = 0#(px5 - x1_) / side_len
                                    offset_py5 = 0#(py5 - y1_) / side_len

                                    # Cut the picture and zoom
                                    face_crop = img.crop(crop_box)  # "Matting", clip out the framed image
                                    face_resize = face_crop.resize((face_size, face_size),
                                                                   Image.ANTIALIAS)  # ★ zoom according to the face size ("pixel matrix size"): 12 / 24 / 48; Coordinates are not scaled

                                    iou = utils.iou(crop_box, np.array(boxes))[0]  # The extracted box and the original box calculate the IOU
                                    if iou > 0.65:  # Positive samples; Originally 0.65
                                        positive_anno_file.write(
                                            "positive/{0}.jpg {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15}\n".format(
                                                positive_count, 1, offset_x1, offset_y1,
                                                offset_x2, offset_y2, offset_px1, offset_py1, offset_px2, offset_py2,
                                                offset_px3,
                                                offset_py3, offset_px4, offset_py4, offset_px5, offset_py5))
                                        positive_anno_file.flush()  # flush: writes data from the cache to a file
                                        face_resize.save(
                                            os.path.join(positive_image_dir, "{0}.jpg".format(positive_count)))  # preservation
                                        positive_count += 1
                                    elif iou > 0.4:  # Partial samples; Originally 0.4
                                        part_anno_file.write(
                                            "part/{0}.jpg {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15}\n".format(
                                                part_count, 2, offset_x1, offset_y1, offset_x2,
                                                offset_y2, offset_px1, offset_py1, offset_px2, offset_py2, offset_px3,
                                                offset_py3, offset_px4, offset_py4, offset_px5, offset_py5))  # Write txt file
                                        part_anno_file.flush()
                                        face_resize.save(os.path.join(part_image_dir, "{0}.jpg".format(part_count)))
                                        part_count += 1
                                    elif iou < 0.29:  # ★ few negative samples are generated in this way; Originally 0.3
                                        negative_anno_file.write(
                                            "negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count, 0))
                                        negative_anno_file.flush()
                                        face_resize.save(os.path.join(negative_image_dir, "{0}.jpg".format(negative_count)))
                                        negative_count += 1

                                # # Generate negative samples
                                # _boxes = np.array(boxes)
                                # for i in range(2):  # The quantity is generally the same as before
                                #     side_len = np.random.randint(face_size, min(img_w, img_h) / 2)
                                #     x_ = np.random.randint(0, img_w - side_len)
                                #     y_ = np.random.randint(0, img_h - side_len)
                                #     crop_box = np.array([x_, y_, x_ + side_len, y_ + side_len])
                                #
                                #     if np.max(utils.iou(crop_box, _boxes)) < 0.29:  # Judge after adding IOU: keep the part less than 0.3; Originally 0.3
                                #         face_crop = img.crop(crop_box)  # Matting
                                #         face_resize = face_crop.resize((face_size, face_size),
                                #                                        Image.ANTIALIAS)  # ANTIALIAS: smooth, antialiasing
                                #
                                #         negative_anno_file.write(
                                #             "negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count, 0))
                                #         negative_anno_file.flush()
                                #         face_resize.save(os.path.join(negative_image_dir, "{0}.jpg".format(negative_count)))
                                #         negative_count += 1

                    except Exception as e:
                        print(e)
                        traceback.print_exc()

            except Exception as e:
                print(e)
                # Close write file
            finally:

                positive_anno_file.close()  # Close the positive sample

                negative_anno_file.close()

                part_anno_file.close()


if __name__ == '__main__':
    data = GenerateData()
    data.run(size=12)
    data.run(size=24)
    data.run(size=48)

Data code

With data enhancement,
During the training process, it is found that the image will treat some hands or vegetables as adult heads, which is really terrible. This is because the people's hands block their heads in celeba data, most of them have yellow skin, and some of them are dark. Therefore, color transformation is used to enhance contrast and color.
During the training process, it is found that the side face is not easy to recognize, because image mirroring is used to flip the side face for many data sets, and the effect is good. solve the problem

# Create dataset
from torch.utils.data import Dataset
import os
import numpy as np
import torch
from PIL import Image
from torchvision import transforms
tf1 = transforms.Compose(
    [transforms.ColorJitter(brightness=0.5),
     transforms.RandomHorizontalFlip(p=0.5)
     ]
)
tf2 = transforms.Compose(
    [transforms.ColorJitter(contrast=0.5),
     transforms.RandomHorizontalFlip(p=0.5)
     ]
)
tf3 = transforms.Compose(
    [transforms.ColorJitter(saturation=0.5),
     transforms.RandomHorizontalFlip(p=0.5)
     ]
)
tf4 = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5)
])

tf = transforms.RandomChoice(
    [
        tf1,tf2,tf3,tf4
    ]
)
# data set
class FaceDataset(Dataset):
    def __init__(self,path_1=r"D:\DataSet",path_2="D:\DataSet\wider",size=12,tf=tf):
        super(FaceDataset, self).__init__()
        self.dataset = []
        self.size = size
        for path in [path_1,path_2]:
            self.base_path_1 = path
            self.path = os.path.join(self.base_path_1,str(self.size))
            for txt in ["positive.txt","negative.txt","part.txt"]:
                with open(os.path.join(self.path,txt),"r") as f:
                    data = f.readlines()
                for line in data:
                    line = line.strip().split()
                    img_path = os.path.join(self.path,line[0])
                    benkdata = " ".join(line[1:])
                    self.dataset.append([img_path,benkdata])

        self.tf = tf

    def __len__(self):
        return len(self.dataset) # Data set length

    def __getitem__(self, index): # get data


        img_path,strs = self.dataset[index]
        strs = strs.strip().split(" ") # Take a piece of data, remove the front and back strings, and then divide it by spaces
        #Label: confidence + offset
        cond = torch.Tensor([int(strs[0])]) # [] don't lose it, otherwise you specify shape
        offset = torch.Tensor([float(strs[1]),float(strs[2]),float(strs[3]),float(strs[4])])
        #Sample: img_data
        # img_path = os.path.join(self.path,strs[0]) # Picture absolute path
        img = Image.open(img_path)

        img = self.tf(img)
        # img.show()
        img = np.array(img) / 255. - 0.5
        img_data = torch.tensor(img,dtype=torch.float32)  # Turn on -- > array -- > normalized de averaging -- > to tensor
        img_data = img_data.permute(2,0,1) # CWH

        # print(img_data.shape) # WHC
        # a = img_data.permute(2,0,1) #Axis transformation
        # print(a.shape) #[3, 48, 48]: CWH
        return img_data,cond,offset

# test
if __name__ == '__main__':


    dataset = FaceDataset(size=12)
    print(dataset[0])
    print(len(dataset))

###Network training

Training code

The training method of cosine annealing method is added, and smoothL1 regression coordinate points are used to classify the loss of BCELoss
p network training

# Create Trainer - to train three networks
import os
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
from torch.utils.data import DataLoader
import torch
from torch import nn
import torch.optim as optim
from sampling import FaceDataset # Import dataset
from models import models
from torch.utils.tensorboard import SummaryWriter
# Create trainer

class Trainer:
    def __init__(self, net, save_path, dataset_size, isCuda=True,SummaryWriter_path=r"run"): # Network, parameter saving path, training data path, cuda acceleration is True
        self.net = net
        self.save_path = save_path

        self.dataset_path = dataset_size
        self.isCuda = isCuda
        # print(self.net.name)
        # self.net.name
        summaryWriter_path = os.path.join(SummaryWriter_path,self.net.name)

        if not os.path.exists(summaryWriter_path):
            os.makedirs(summaryWriter_path)

        length = len(os.listdir(summaryWriter_path))
        path_name = os.path.join(summaryWriter_path, "exp" + str(length))
        os.makedirs(path_name)

        self.summaryWriter = SummaryWriter(path_name)

        if self.isCuda:      # By default, there is an else after it
            self.net.cuda() # Speed up the network

        # Create loss function
        # Confidence loss
        self.cls_loss_fn = nn.BCELoss() # ★ two class cross entropy loss function is a special case of multi class cross entropy loss; Before using BCELoss, you must activate with sigmoid, and before using CrossEntropyLoss, you must use the softmax function
        # Offset loss
        self.offset_loss_fn = nn.SmoothL1Loss()

        # Create optimizer
        self.optimizer = optim.SGD(self.net.parameters(),lr=0.0001,momentum=0.9)

        # Resume network training - load model parameters and continue training
        if os.path.exists(self.save_path): # If the file exists, continue training
            net.load_state_dict(torch.load(self.save_path),strict=False)

    # training method
    def train(self,epochs=1000):

        faceDataset = FaceDataset(size=self.dataset_path) # data set
        dataloader = DataLoader(faceDataset, batch_size=256, shuffle=True, num_workers=4,drop_last=True) # Data Loader 
        #num_workers=4: there are 4 threads loading data (loading data takes time to prevent vacancy); drop_last: when it is True, it means to prevent error reporting due to insufficient batches.
        scheduler = CosineAnnealingWarmRestarts(self.optimizer, T_0=5, T_mult=1)

        self.best_loss = 1

        for epoch in range(epochs):
            for i, (img_data_, category_, offset_) in enumerate(dataloader): # Sample, confidence, offset
                if self.isCuda:                    # cuda reads the data into the video memory (first through the memory); No cuda in memory, cuda in video memory
                    img_data_ = img_data_.cuda()  # [512, 3, 12, 12]
                    category_ = category_.cuda() # 512, 1]
                    offset_ = offset_.cuda()    # [512, 4]

                # Network output
                _output_category, _output_offset = self.net(img_data_) # Output confidence, offset

                # print(_output_category.shape)     # [512, 1, 1, 1]
                # print(_output_offset.shape)       # [512, 4, 1, 1]
                output_category = _output_category.reshape(-1, 1) # [512,1]
                output_offset = _output_offset.reshape(-1, 4)     # [512,4]
                # output_landmark = _output_landmark.view(-1, 10)

                # Calculate the loss of classification - confidence
                category_mask = torch.lt(category_, 2)  # Mask the positive samples (1) and negative samples (0) with confidence less than 2; ★ some samples (2) do not participate in loss calculation; 1 is returned if it meets the conditions, and 0 is returned if it does not meet the conditions
                category = torch.masked_select(category_, category_mask)              # For the selection mask with confidence less than 2 in the label, return the qualified results
                output_category = torch.masked_select(output_category, category_mask) # The predicted "tag" is masked to return qualified results
                cls_loss = self.cls_loss_fn(output_category, category)                # Loss of confidence

                # Calculate the loss offset of bound regression
                offset_mask = torch.gt(category_, 0)  # Mask tags with confidence greater than 0; ★ negative samples do not participate in the calculation, and negative samples have no offset; [512,1]
                offset_index = torch.nonzero(offset_mask)[:, 0]  # Select the index of non negative samples; [244]
                offset = offset_[offset_index]                   # Offset in label; [244,4]
                output_offset = output_offset[offset_index]      # Output offset; [244,4]
                offset_loss = self.offset_loss_fn(output_offset, offset)  # Offset loss

                #Total loss
                loss = cls_loss + offset_loss
                # Back propagation, network optimization
                self.optimizer.zero_grad() # Gradient before emptying
                loss.backward()           # Calculated gradient
                self.optimizer.step()    # Optimize network

                #Output loss: loss -- > GPU -- > cup (variable) - > tensor -- > array
                print("epoch=",epoch ,"loss:", loss.cpu().data.numpy(), " cls_loss:", cls_loss.cpu().data.numpy(), " offset_loss",
                      offset_loss.cpu().data.numpy())


                self.summaryWriter.add_scalars("loss", {"loss": loss.cpu().data.numpy(),
                                                        "cls_loss": cls_loss.cpu().data.numpy(),
                                                        "offser_loss": offset_loss.cpu().data.numpy()},epoch)
                # self.summaryWriter.add_histogram("pre_conv_layer1",self.net.pre_layer[0].weight,epoch)
                # self.summaryWriter.add_histogram("pre_conv_layer2",self.net.pre_layer[3].weight,epoch)
                # self.summaryWriter.add_histogram("pre_conv_layer3",self.net.pre_layer[5].weight,epoch)


                # preservation
                if i%5==0:
                    if i%500==0:
                        torch.save(self.net.state_dict(), self.save_path)  # state_dict save network parameters, save_ The path parameter saves the path
                        print("save success")  # Save once per round; It's best to make a judgment: save it once when the loss decreases
                    if loss.cpu().data.numpy()<self.best_loss:
                        self.best_loss = loss.cpu().data.numpy()
                        torch.save(self.net.state_dict(), self.save_path) # state_dict save network parameters, save_ The path parameter saves the path
                        print("save success")# Save once per round; It's best to make a judgment: save it once when the loss decreases

            scheduler.step()

if __name__ == '__main__':
    
    net = models.PNet()
    trainer = Trainer(net, 'pnet.pt', dataset_size=12) # Network, save parameters and training data; Create trainer
    trainer.train()                                                     # Call the train method in the trainer
    # net = models.RNet()
    # trainer = Trainer(net, 'rnet.pt', r"D:\DataSet\24") # Network, save parameters and training data; Create trainer
    # trainer.train()
    # net = models.ONet()
    # trainer = Trainer(net, 'onet.pt', r"D:\DataSet\48") # Network, save parameters and training data; Create trainer
    # trainer.train()

R network training

# Create Trainer - to train three networks


import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
from torch.utils.data import DataLoader
import torch
from torch import nn
import torch.optim as optim
from sampling import FaceDataset  # Import dataset
from models import models
from torch.utils.tensorboard import SummaryWriter


# Create trainer
class Trainer:
    def __init__(self, net, save_path, dataset_path, isCuda=True,
                 SummaryWriter_path=r"run"):  # Network, parameter saving path, training data path, cuda acceleration is True
        self.net = net
        self.save_path = save_path
        self.dataset_path = dataset_path
        self.isCuda = isCuda
        # print(self.net.name)
        # self.net.name
        summaryWriter_path = os.path.join(SummaryWriter_path, self.net.name)

        if not os.path.exists(summaryWriter_path):
            os.makedirs(summaryWriter_path)

        length = len(os.listdir(summaryWriter_path))
        path_name = os.path.join(summaryWriter_path, "exp" + str(length))
        os.makedirs(path_name)

        self.summaryWriter = SummaryWriter(path_name)

        if self.isCuda:  # By default, there is an else after it
            self.net.cuda()  # Speed up the network

        # Create loss function
        # Confidence loss
        self.cls_loss_fn = nn.BCELoss()  # ★ two class cross entropy loss function is a special case of multi class cross entropy loss; Before using BCELoss, you must activate with sigmoid, and before using CrossEntropyLoss, you must use the softmax function
        # Offset loss
        self.offset_loss_fn = nn.SmoothL1Loss()

        # Create optimizer
        self.optimizer = optim.SGD(self.net.parameters(),lr=0.0001,momentum=0.8)

        # Resume network training - load model parameters and continue training
        if os.path.exists(self.save_path):  # If the file exists, continue training
            net.load_state_dict(torch.load(self.save_path),strict=False)

    # training method
    def train(self):
        faceDataset = FaceDataset(size=self.dataset_path)  # data set
        dataloader = DataLoader(faceDataset, batch_size=128, shuffle=True, num_workers=2, drop_last=True)  # Data Loader 
        # num_workers=4: there are 4 threads loading data (loading data takes time to prevent vacancy); drop_last: when it is True, it means to prevent error reporting due to insufficient batches.
        self.best_loss = None
        while True:
            for i, (img_data_, category_, offset_) in enumerate(dataloader):  # Sample, confidence, offset
                if self.isCuda:  # cuda reads the data into the video memory (first through the memory); No cuda in memory, cuda in video memory
                    img_data_ = img_data_.cuda()  # [512, 3, 12, 12]
                    category_ = category_.cuda()  # 512, 1]
                    offset_ = offset_.cuda()  # [512, 4]

                # Network output
                _output_category, _output_offset = self.net(img_data_)  # Output confidence, offset

                # print(_output_category.shape)     # [512, 1, 1, 1]
                # print(_output_offset.shape)       # [512, 4, 1, 1]
                output_category = _output_category.view(-1, 1)  # [512,1]
                output_offset = _output_offset.view(-1, 4)  # [512,4]
                # output_landmark = _output_landmark.view(-1, 10)

                # Calculate the loss of classification - confidence
                category_mask = torch.lt(category_, 2)  # Mask the positive samples (1) and negative samples (0) with confidence less than 2; ★ some samples (2) do not participate in loss calculation; 1 is returned if it meets the conditions, and 0 is returned if it does not meet the conditions
                category = torch.masked_select(category_, category_mask)  # For the selection mask with confidence less than 2 in the label, return the qualified results
                output_category = torch.masked_select(output_category, category_mask)  # The predicted "tag" is masked to return qualified results
                cls_loss = self.cls_loss_fn(output_category, category)  # Loss of confidence

                # Calculate the loss offset of bound regression
                offset_mask = torch.gt(category_, 0)  # Mask tags with confidence greater than 0; ★ negative samples do not participate in the calculation, and negative samples have no offset; [512,1]
                offset_index = torch.nonzero(offset_mask)[:, 0]  # Select the index of non negative samples; [244]
                offset = offset_[offset_index]  # Offset in label; [244,4]
                output_offset = output_offset[offset_index]  # Output offset; [244,4]
                offset_loss = self.offset_loss_fn(output_offset, offset)  # Offset loss

                # Total loss
                loss = 0.5*cls_loss + offset_loss

                if i == 0:
                    self.best_loss = loss.cpu().data.numpy()
                # Back propagation, network optimization
                self.optimizer.zero_grad()  # Gradient before emptying
                loss.backward()  # Calculated gradient
                self.optimizer.step()  # Optimize network

                # Output loss: loss -- > GPU -- > cup (variable) - > tensor -- > array
                print("i=", i, "loss:", loss.cpu().data.numpy(), " cls_loss:", cls_loss.cpu().data.numpy(),
                      " offset_loss",
                      offset_loss.cpu().data.numpy())

                self.summaryWriter.add_scalars("loss", {"loss": loss.cpu().data.numpy(),
                                                        "cls_loss": cls_loss.cpu().data.numpy(),
                                                        "offser_loss": offset_loss.cpu().data.numpy()},i)
                # self.summaryWriter.add_histogram("pre_conv_layer1", self.net.pre_layer[0].weight,i)
                # self.summaryWriter.add_histogram("pre_conv_layer2", self.net.pre_layer[3].weight,i)
                # self.summaryWriter.add_histogram("pre_conv_layer3", self.net.pre_layer[6].weight,i)
                # preservation
                if (i + 1) % 100 == 0 or self.best_loss>loss.cpu().data.numpy():
                    self.best_loss = loss.cpu().data.numpy()
                    torch.save(self.net.state_dict(), self.save_path)  # state_dict save network parameters, save_ The path parameter saves the path
                    print("save success")  # Save once per round; It's best to make a judgment: save it once when the loss decreases


if __name__ == '__main__':

    net = models.RNet()
    trainer = Trainer(net, 'rnet.pt', 24) # Network, save parameters and training data; Create trainer
    trainer.train()

O network training

# Create Trainer - to train three networks


import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
from torch.utils.data import DataLoader
import torch
from torch import nn
import torch.optim as optim
from sampling import FaceDataset  # Import dataset
from models import models
from torch.utils.tensorboard import SummaryWriter


# Create trainer
class Trainer:
    def __init__(self, net, save_path, dataset_path, isCuda=True,
                 SummaryWriter_path=r"run"):  # Network, parameter saving path, training data path, cuda acceleration is True
        self.net = net
        self.save_path = save_path
        self.dataset_path = dataset_path
        self.isCuda = isCuda
        # print(self.net.name)
        # self.net.name
        summaryWriter_path = os.path.join(SummaryWriter_path, self.net.name)

        if not os.path.exists(summaryWriter_path):
            os.makedirs(summaryWriter_path)

        length = len(os.listdir(summaryWriter_path))
        path_name = os.path.join(summaryWriter_path, "exp" + str(length))
        os.makedirs(path_name)

        self.summaryWriter = SummaryWriter(path_name)

        if self.isCuda:  # By default, there is an else after it
            self.net.cuda()  # Speed up the network

        # Create loss function
        # Confidence loss
        self.cls_loss_fn = nn.BCELoss()  # ★ two class cross entropy loss function is a special case of multi class cross entropy loss; Before using BCELoss, you must activate with sigmoid, and before using CrossEntropyLoss, you must use the softmax function
        # Offset loss
        self.offset_loss_fn = nn.SmoothL1Loss()

        # Create optimizer
        self.optimizer = optim.SGD(self.net.parameters(),lr=0.0001,momentum=0.8)

        # Resume network training - load model parameters and continue training
        if os.path.exists(self.save_path):  # If the file exists, continue training
            net.load_state_dict(torch.load(self.save_path),strict=False)

    # training method
    def train(self):
        faceDataset = FaceDataset(size=self.dataset_path)  # data set
        dataloader = DataLoader(faceDataset, batch_size=128, shuffle=True, num_workers=2, drop_last=True)  # Data Loader 
        # num_workers=4: there are 4 threads loading data (loading data takes time to prevent vacancy); drop_last: when it is True, it means to prevent error reporting due to insufficient batches.
        self.best_loss = None
        while True:
            for i, (img_data_, category_, offset_) in enumerate(dataloader):  # Sample, confidence, offset
                if self.isCuda:  # cuda reads the data into the video memory (first through the memory); No cuda in memory, cuda in video memory
                    img_data_ = img_data_.cuda()  # [512, 3, 12, 12]
                    category_ = category_.cuda()  # 512, 1]
                    offset_ = offset_.cuda()  # [512, 4]

                # Network output
                _output_category, _output_offset = self.net(img_data_)  # Output confidence, offset

                # print(_output_category.shape)     # [512, 1, 1, 1]
                # print(_output_offset.shape)       # [512, 4, 1, 1]
                output_category = _output_category.view(-1, 1)  # [512,1]
                output_offset = _output_offset.view(-1, 4)  # [512,4]
                # output_landmark = _output_landmark.view(-1, 10)

                # Calculate the loss of classification - confidence
                category_mask = torch.lt(category_, 2)  # Mask the positive samples (1) and negative samples (0) with confidence less than 2; ★ some samples (2) do not participate in loss calculation; 1 is returned if it meets the conditions, and 0 is returned if it does not meet the conditions
                category = torch.masked_select(category_, category_mask)  # For the selection mask with confidence less than 2 in the label, return the qualified results
                output_category = torch.masked_select(output_category, category_mask)  # The predicted "tag" is masked to return qualified results
                cls_loss = self.cls_loss_fn(output_category, category)  # Loss of confidence

                # Calculate the loss offset of bound regression
                offset_mask = torch.gt(category_, 0)  # Mask tags with confidence greater than 0; ★ negative samples do not participate in the calculation, and negative samples have no offset; [512,1]
                offset_index = torch.nonzero(offset_mask)[:, 0]  # Select the index of non negative samples; [244]
                offset = offset_[offset_index]  # Offset in label; [244,4]
                output_offset = output_offset[offset_index]  # Output offset; [244,4]
                offset_loss = self.offset_loss_fn(output_offset, offset)  # Offset loss

                # Total loss
                loss = 0.5*cls_loss + offset_loss

                if i == 0:
                    self.best_loss = loss.cpu().data.numpy()
                # Back propagation, network optimization
                self.optimizer.zero_grad()  # Gradient before emptying
                loss.backward()  # Calculated gradient
                self.optimizer.step()  # Optimize network

                # Output loss: loss -- > GPU -- > cup (variable) - > tensor -- > array
                print("i=", i, "loss:", loss.cpu().data.numpy(), " cls_loss:", cls_loss.cpu().data.numpy(),
                      " offset_loss",
                      offset_loss.cpu().data.numpy())

                self.summaryWriter.add_scalars("loss", {"loss": loss.cpu().data.numpy(),
                                                        "cls_loss": cls_loss.cpu().data.numpy(),
                                                        "offser_loss": offset_loss.cpu().data.numpy()},i)
                # self.summaryWriter.add_histogram("pre_conv_layer1", self.net.pre_layer[0].weight,i)
                # self.summaryWriter.add_histogram("pre_conv_layer2", self.net.pre_layer[3].weight,i)
                # self.summaryWriter.add_histogram("pre_conv_layer3", self.net.pre_layer[6].weight,i)
                # preservation
                if (i + 1) % 100 == 0 or self.best_loss>loss.cpu().data.numpy():
                    self.best_loss = loss.cpu().data.numpy()
                    torch.save(self.net.state_dict(), self.save_path)  # state_dict save network parameters, save_ The path parameter saves the path
                    print("save success")  # Save once per round; It's best to make a judgment: save it once when the loss decreases


if __name__ == '__main__':

    net = models.RNet()
    trainer = Trainer(net, 'rnet.pt', 24) # Network, save parameters and training data; Create trainer
    trainer.train()

Detection code

The reason why p network adopts full convolution method is that it can input pictures of different scales and use the output characteristic map for inverse calculation

import os
import time

import cv2

from tools import utils
import numpy as np
import torch
from PIL import Image, ImageDraw
from torchvision import transforms

from models import models


class Detector():
    def __init__(self,
                 pnet_param="pnet.pt",
                 rnet_param="rnet.pt",
                 onet_param="onet.pt",
                 isCuda=True,
                 # Network tuning
                 p_cls=0.6,  # Originally 0.6
                 p_nms=0.5,  # Originally 0.5
                 r_cls=0.6,  # Originally 0.6
                 r_nms=0.5,  # Originally 0.5
                 # R network:
                 o_cls=0.99,  # Originally 0.97
                 o_nms=0.6,  # Originally 0.7
                 ):


        self.isCuda = isCuda
        self.pnet = models.PNet()  # Create instance variables to instantiate P network
        self.rnet = models.RNet()
        self.onet = models.ONet()

        if self.isCuda:
            self.pnet.cuda()
            self.rnet.cuda()
            self.onet.cuda()

        self.pnet.load_state_dict(torch.load(pnet_param))  # Load the trained weight into the P-network
        self.rnet.load_state_dict(torch.load(rnet_param))
        self.onet.load_state_dict(torch.load(onet_param))

        self.pnet.eval()  # If there is BN (batch normalization) in the training network, the eval method should be called, and the BN and dropout methods should not be used
        self.rnet.eval()
        self.onet.eval()

        self.p_cls = p_cls  # Originally 0.6
        self.p_nms = p_nms  # Originally 0.5
        self.r_cls = r_cls  # Originally 0.6
        self.r_nms = r_nms   # Originally 0.5
        # R network:
        self.o_cls = o_cls  # Originally 0.97
        self.o_nms = o_nms  # Originally 0.7

        self.__image_transform = transforms.Compose(
            [
                transforms.ToTensor()
            ]
        )

    def detect(self, image):
        # P network detection ----- 1st
        start_time = time.time()
        pnet_boxes = self.__pnet_detect(image) # Call__ pnet_detect function (defined later)
        if pnet_boxes.shape[0]==0:
            return np.array([])
        end_time = time.time()
        t_pnet = end_time - start_time  # Time difference occupied by P network
        # print(pnet_boxes.shape)
        # return pnet_boxes                    # p network detected box

        start_time = time.time()
        rnet_boxes = self.__rnet_detect(image,pnet_boxes)  # Call__ pnet_detect function (defined later)
        if rnet_boxes.shape[0] == 0:
            return np.array([])
        end_time = time.time()
        t_rnet = end_time - start_time  # r time difference occupied by the network
        # return rnet_boxes                    # r network detected box

        onet_boxes = self.__onet_detect(image,rnet_boxes)  # Call__ pnet_detect function (defined later)
        if rnet_boxes.shape[0] == 0:
            return np.array([])
        end_time = time.time()
        t_onet = end_time - start_time  # Time difference occupied by P network
        # return onet_boxes                    # p network detected box

        # III. total time of network detection
        t_sum = t_pnet + t_rnet + t_onet
        print("total:{0} pnet:{1} rnet:{2} onet:{3}".format(t_sum, t_pnet, t_rnet, t_onet))

        return onet_boxes

    def __pnet_detect(self, image):  # ★ p the network is all convolution, independent of the input picture size, and can output pictures of any shape
        boxes = []  # Create an empty list and receive qualified suggestion boxes

        img = image
        w, h = img.size
        min_side_len = min(w, h)  # Gets the minimum side length of the picture

        scale = 1  # Initial scaling (no scaling at 1): get pictures of different resolutions
        while min_side_len > 12:  # Stop until you zoom to 12 or less
            img_data = self.__image_transform(img)  # Convert image array to tensor
            if self.isCuda:
                img_data = img_data.cuda()  # Send the picture tensor to cuda for acceleration
            img_data.unsqueeze_(0)  # Increase dimension in "batch" (more than one picture is transmitted during test)
            # print("img_data:",img_data.shape) # [1, 3, 416, 500]: C=3,W=416,H=500

            _cls, _offest = self.pnet(img_data)  # ★★ return multiple confidence levels and offsets
            # print("_cls",_cls.shape)         # [1, 1, 203, 245]:NCWH: channel and size of characteristic graph of block convolution ★
            # print("_offest", _offest.shape) # [1, 4, 203, 245]:NCWH

            cls = _cls[0][0].cpu().data  # [203, 245]: size of block convolution characteristic graph: W,H
            offest = _offest[0].cpu().data  # [4, 203, 245] # Channel and size of block convolution characteristic graph: C,W,H
            idxs = torch.nonzero(torch.gt(cls, self.p_cls))  # ★ box index with confidence greater than 0.6; Output the P network to see if there is a face that is not framed. If there is no face framed, it means that the network is not well trained; Or the confidence is high or low
            # print(idxs)
            for idx in idxs:  # Add qualified boxes according to the index; cls[idx[0], idx[1]] values in confidence: idx[0] row index, idx[1] column index
                boxes.append(self.__box(idx, offest, cls[idx[0], idx[1]], scale))  # ★ call box inverse function_ Box (reverse the box on the feature map to the original map), and leave the box greater than 0.6;
            scale *= 0.7  # Scaling pictures: loop control conditions
            _w = int(w * scale)  # New width
            _h = int(h * scale)

            img = img.resize((_w, _h))  # The picture is scaled according to the scaled width and height
            min_side_len = min(_w, _h)  # Regain minimum width and height

        return utils.nms(np.array(boxes), self.p_nms)  # Return to the box, and the original threshold is given to p_nms=0.5 (IOU is 0.5). Keep some boxes with IOU less than 0.5 as far as possible. If the network training is good, the value can be lower

        # Feature inverse calculation: restore the regression quantity to the original graph, and inverse the calculated result to the original graph suggestion box according to the feature graph

    def __box(self, start_index, offset, cls, scale, stride=2, side_len=12):  # p the network pooling step is 2

        _x1 = (start_index[1].float() * stride) / scale  # The index is multiplied by the step size and divided by the scaling scale; ★ during feature reverse calculation, "row index, index exchange", originally [0]
        _y1 = (start_index[0].float() * stride) / scale
        _x2 = (start_index[1].float() * stride + side_len) / scale
        _y2 = (start_index[0].float() * stride + side_len) / scale

        ow = _x2 - _x1  # The width and height of the suggestion box in the area where the face is located
        oh = _y2 - _y1

        _offset = offset[:, start_index[0], start_index[1]]  # Find the corresponding offset △ according to the idxs row index and column index δ: [x1,y1,x2,y2]

        x1 = _x1 + ow * _offset[0]  # Calculate the position of the actual frame according to the offset, x1=x1_+w*△ δ； For raw samples: △ δ= x1-x1_/w
        y1 = _y1 + oh * _offset[1]
        x2 = _x2 + ow * _offset[2]
        y2 = _y2 + oh * _offset[3]

        return [x1, y1, x2, y2, cls]  # Formal box: returns 4 coordinate points and 1 offset


    def __rnet_detect(self, image, pnet_boxes):

        _img_dataset = []  # Create an empty list to store matting
        _pnet_boxes = utils.convert_to_square(pnet_boxes)  # ★ for the box output to the p network, find out the center point, expand it into a "square" along both sides of the maximum side length, and then cut the picture
        for _box in _pnet_boxes:  # ★ traverse each box, each box returns 4 coordinate points, matting, zooming, data type conversion and adding list
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])

            img = image.crop((_x1, _y1, _x2, _y2))  # Matting according to 4 coordinate points
            img = img.resize((24, 24))  # Put and shrink in solid size
            img_data = self.__image_transform(img)  # Convert image array to tensor
            _img_dataset.append(img_data)

        img_dataset = torch.stack(_img_dataset)  # Stack stack (0 axis by default). This is equivalent to data type conversion. See example 2 ★
        if self.isCuda:
            img_dataset = img_dataset.cuda()  # cuda acceleration for image data

        _cls, _offset = self.rnet(img_dataset)  # ★★ send 24 * 24 pictures to the network for another screening

        cls = _cls.cpu().data.numpy()  # Put the data on the gpu on the cpu and convert it into a numpy array
        offset = _offset.cpu().data.numpy()
        # print("r_cls:",cls.shape)  # (11, 1): the P network generates 11 boxes
        # print("r_offset:", offset.shape)  # (11, 4)

        boxes = []  # The boxes to be left by the R network are saved in boxes
        idxs, _ = np.where(
            cls > self.r_cls)  # The original confidence of 0.6 is low. Sometimes many boxes are useless (they can be printed out for observation), so they can be adjusted higher appropriately; Index with idxs confidence frame greater than 0.6; ★ return idxs: index on axis 0 [0,1],: The index [0,0] on axis 1 jointly determines the element position. See example 3
        for idx in idxs:  # Traverse the qualified boxes according to the index; The index on axis 1 is exactly the qualified confidence index (the index on axis 0 is not used here)
            _box = _pnet_boxes[idx]
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])

            ow = _x2 - _x1  # Width of datum frame
            oh = _y2 - _y1

            x1 = _x1 + ow * offset[idx][0]  # Coordinate point of actual frame
            y1 = _y1 + oh * offset[idx][1]
            x2 = _x2 + ow * offset[idx][2]
            y2 = _y2 + oh * offset[idx][3]

            boxes.append([x1, y1, x2, y2, cls[idx][0]])  # Return 4 coordinate points and confidence

        return utils.nms(np.array(boxes), self.r_nms)  # Original r_nms is 0.5 (0.5 to minor), and the above 0.6 to major; Boxes less than 0.5 are retained

        # Create O network detection function

    def __onet_detect(self, image, rnet_boxes):

        _img_dataset = []  # Create a list to store matting r
        _rnet_boxes = utils.convert_to_square(rnet_boxes)  # For the output box of r network, find out the center point and expand it into a "square" along both sides of the maximum side length
        for _box in _rnet_boxes:  # Traverse the filtered box of R network, calculate coordinates, matting, scaling, data type conversion, add list and stack
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])

            img = image.crop((_x1, _y1, _x2, _y2))  # "Matting" according to coordinate points
            img = img.resize((48, 48))
            img_data = self.__image_transform(img)  # Turn the extracted graph into a tensor
            _img_dataset.append(img_data)

        img_dataset = torch.stack(_img_dataset)  # Stack, which is equivalent to data format conversion. See example 2
        if self.isCuda:
            img_dataset = img_dataset.cuda()

        _cls, _offset = self.onet(img_dataset)
        cls = _cls.cpu().data.numpy()  # (1, 1)
        offset = _offset.cpu().data.numpy()  # (1, 4)

        boxes = []  # Store o network calculation results
        idxs, _ = np.where(
            cls > self.o_cls)  # Original o_cls of 0.97 is low. Finally, to reach the standard, the confidence level should reach 0.99999. Here, it can be written as 0.99998. In this way, it will be all faces; Leave a box with a confidence greater than 0.97; ★ return to index [0] on idxs:0 axis,: The index [0] on axis 1 jointly determines the element position, as shown in example 3
        for idx in idxs:  # Traverse the qualified boxes according to the index; The index on axis 1 is exactly the qualified confidence index (the index on axis 0 is not used here)
            _box = _rnet_boxes[idx]  # Take R network as the benchmark box
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])

            ow = _x2 - _x1  # The base width of the box. The box is "square", ow=oh
            oh = _y2 - _y1

            x1 = _x1 + ow * offset[idx][0]  # O coordinates of the frame finally generated by the network; Raw sample, offset △ δ= x1-_x1/w*side_len
            y1 = _y1 + oh * offset[idx][1]
            x2 = _x2 + ow * offset[idx][2]
            y2 = _y2 + oh * offset[idx][3]

            boxes.append([x1, y1, x2, y2, cls[idx][0]])  # Returns 4 coordinate points and 1 confidence

        return utils.nms(np.array(boxes), self.o_nms, isMin=True)  # IOU with minimum area; Original o_ Boxes with NMS (IOU) less than 0.7 are retained


def detect(imgs_path="./test_images", save_path="./result_images",test_video=False):
    if not os.path.exists(save_path):
        os.makedirs(save_path)


    if test_video:
    #  Multiple picture display
        detector = Detector()
        cap = cv2.VideoCapture('./test_video/Cai Xukun.mp4')
        num = 0
        while cap.isOpened():
            ret,frame = cap.read()
            if cv2.waitKey(25) & 0xFF == ord('q'):
                break
            image = Image.fromarray(cv2.cvtColor(frame,cv2.COLOR_BGR2RGB))
            print("-------------------")
            boxes = detector.detect(image)
            print("size:", image.size)
            for box in boxes:
                x1 = int(box[0])
                y1 = int(box[1])
                x2 = int(box[2])
                y2 = int(box[3])
                print(x1, y1, x2, y2)
                print("conf:", box[4])
                head = image.crop(image,[x1,y1,x2,y2])
                head.save(f"./Cai Xukun photos/{num}.jpg")
                num+=1
                cv2.rectangle(frame, (x1, y1), (x2, y2), (0,0,255), 2)
            cv2.imshow("res",frame)


        cap.release()
        cv2.destroyAllWindows()

    else:
        for img in os.listdir(imgs_path):
            detector = Detector()
            with Image.open(os.path.join(imgs_path, img)) as image:
                print("-------------------")
                boxes = detector.detect(image)
                print("size:", image.size)
                img_draw = ImageDraw.Draw(image)
                for box in boxes:
                    x1 = int(box[0])
                    y1 = int(box[1])
                    x2 = int(box[2])
                    y2 = int(box[3])
                    print(x1, y1, x2, y2,"conf:", box[4])
                    # print("conf:", box[4])
                    img_draw.rectangle((x1, y1, x2, y2), outline="red")

                image.show()
                image.save(os.path.join(save_path, img))

if __name__ == '__main__':
    detect()

summary

There are many problems in face detection:

In the first stage of MTCNN, the image pyramid will repeatedly call a very shallow P-NET network many times, resulting in data from memory COPY to video memory and from video memory COPY to memory. This COPY operation consumes a lot, even more time than the calculation itself.
The running speed of P network has a great impact on the whole model; R. O network matting file operation takes time; for loop serial time consuming; Hardware problems exacerbate the time-consuming model
Reason: image pyramid takes a lot of time, and Tensor and matrix are not used for inverse calculation; Matting is not completed with slices; Generally, the more advanced the language, the slower it will run.
Solution: adjust the zoom ratio according to the actual needs; Use tensor and matrix operation to optimize the code.
When invoking the model, remember eval()
The best way to solve the problem is to add accurate data
Don't use ReLU here. The effect is not good. The loss of negative half axis will cause certain information loss.

Initial summary: if there are deficiencies, please ask for advice.
My direction target recognition, interested partners can communicate together
WX:

Posted by jadebabe on Fri, 05 Nov 2021 16:42:18 -0700

Programmer Group