[Special Topic on semantic segmentation] work related to semantic segmentation -- work related to ENet network

Keywords: Python Computer Vision Deep Learning

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Paszke, A., Chaurasia, A., Kim, S., & Culurciello, E. (2016). ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. ArXiv, abs/1606.02147.

  # Initial block of the model:
  #         Input
  #        /     \
  #       /       \
  #maxpool2d    conv2d-3x3
  #       \       /  
  #        \     /
  #      concatenate


 # Upsampling bottleneck:
  #     Bottleneck Input
  #        /        \
  #       /          \
  # conv2d-1x1     convTrans2d-1x1
  #      |             | PReLU
  #      |         convTrans2d-3x3
  #      |             | PReLU
  #      |         convTrans2d-1x1
  #      |             |
  # maxunpool2d    Regularizer
  #       \           /  
  #        \         /
  #      Summing + PReLU
  #
  #  Params: 
  #  projection_ratio - ratio between input and output channels
  #  relu - if True: relu used as the activation function else: Prelu us used



  # Regular|Dilated|Downsampling bottlenecks:
  #
  #     Bottleneck Input
  #        /        \
  #       /          \
  # maxpooling2d   conv2d-1x1
  #      |             | PReLU
  #      |         conv2d-3x3
  #      |             | PReLU
  #      |         conv2d-1x1
  #      |             |
  #  Padding2d     Regularizer
  #       \           /  
  #        \         /
  #      Summing + PReLU
  #
  # Params: 
  #  dilation (bool) - if True: creating dilation bottleneck
  #  down_flag (bool) - if True: creating downsampling bottleneck
  #  projection_ratio - ratio between input and output channels
  #  relu - if True: relu used as the activation function else: Prelu us used
  #  p - dropout ratio



  # Asymetric bottleneck:
  #
  #     Bottleneck Input
  #        /        \
  #       /          \
  #      |         conv2d-1x1
  #      |             | PReLU
  #      |         conv2d-1x5
  #      |             |
  #      |         conv2d-5x1
  #      |             | PReLU
  #      |         conv2d-1x1
  #      |             |
  #  Padding2d     Regularizer
  #       \           /  
  #        \         /
  #      Summing + PReLU
  #
  # Params:    
  #  projection_ratio - ratio between input and output channels

Thesis structure

Abstract
Introduction

The current image segmentation network mainly uses the architecture of VGG16, a large number of parameters and long reasoning time.

Think about Google's new efficient net network architecture.
Related Work

large architectures and numerous parameters
Network architecture
Design choices

bottleneck:
bottleneck sampled below:
The main line consists of three convolution layers,
First two × 2. Downsampling of projection;
Then there is convolution (there are three possibilities, Conv ordinary convolution, asymmetric decomposition convolution, and Dilated hole convolution)
Followed by a 1 × Note that each convolution layer is followed by Batch
Norm and PReLU.
Auxiliary lines include maximum pooling and Padding layers
Maximum pooling is responsible for extracting context information
Padding is responsible for filling the channel and connecting the PReLU after the subsequent residual fusion.
bottleneck without downsampling:
The main line consists of three convolution layers,
First 1 × 1 projection;
Then there is convolution (there are three possibilities, Conv ordinary convolution, asymmetric decomposition convolution, and Dilated hole convolution)
Followed by a 1 × Note that each convolution layer is followed by Batch
Norm and PReLU.
Direct identity mapping of auxiliary lines (only down sampling will increase the number of channels, so the padding layer is not required here)
Connect PReLU after fusion.

Feature map resolution: spatial information loss caused by down sampling will affect edge information. The input and output are required to have the same resolution. Strong down sampling also requires strong up sampling. skip connections in FCN; segnet indexes by pooling. In this paper, the input is compressed first, and only a small feature map is input to the network structure to remove the visual redundancy of some pictures. However, the advantage of down sampling is that it can obtain a larger receptive field and more context information for classification.
The solution of FCN is to plug the feature map in the encoder stage into the decoder to increase spatial information.
SegNet's solution is to reserve the indexes for downsampling in the encoder stage for upsampling in the decoder stage.
ENet adopts SegNet method, which can reduce memory requirements. At the same time, in order to increase better context information, we use dilated conv (hole convolution) to expand the context information.

Post processing modules: CRF and RNN can be used to improve accuracy.
stage

The ENet model is roughly divided into five stage s:
-initial: initialize the module
On the left is do 3 × For the convolution of 3/str=2, MaxPooling is done on the right, concat enate the results on both sides, and merge the channels, which can significantly reduce the storage space.
-stage1: encoder stage. It includes 5 bottlenecks. The first bottlenecks are sampled, and the next 4 duplicate bottlenecks
-Stage 2-3: encoder stage. Bottleneck 2.0 of stage 2 does down sampling, followed by hole convolution or decomposition convolution. stage3 has no down sampling. Everything else is the same.
-Stage 4-5: it belongs to the decoder stage. It's simple. Two ordinary bottleneck s are configured for one upsampling.
Network architecture

Early downsampling: early processing of high-resolution input will consume a lot of computing resources, and the initialization model of ENet will greatly reduce the size of input. This is considering that visual information is highly redundant in space and can be compressed into a more effective representation.

Decoder size: size compared with the mirror symmetry of encoder and decoder in SegNet, the encoder and decoder of ENet are asymmetric and consist of a larger encoder and a smaller decoder.

Nonlinear operations: relu is not necessarily useful; PRELU. The application of nonlinear activation function relu reduces the accuracy. The reason for analysis is that the number of network layers is too few and not deep enough to filter information quickly. Prelu is used.

Factorizing filters: n × The convolution kernel of n is split into n × 1 and 1 × N (proposed by inception V3). It can effectively reduce the amount of parameters and improve the receptive field of the model.

Divided convolutions: void convolution, and the accuracy increases slightly. Cavity convoluted convolutions can effectively improve the receptive field. The effective use of divided revolutions improves the IoU by 4%. The use of divided revolutions is cross use rather than continuous use.

Regularization: because the data set itself is small, it will be over fitted soon. The effect of using L2 is not good. It's OK to use stochastic depth, but after thinking about it, stochastic depth is a special case of Spatial Dropout. Therefore, the effect of choosing Spatial Dropout is relatively better

Regularization
Results
Conclutison

Reference link: https://github.com/srihari-humbarwadi/ENet-A-Deep-Neural-Network-Architecture-for-Real-Time-Semantic-Segmentation/blob/master/batch_training.py

code

def initial_block(tensor):
    conv = Conv2D(filters=13,kernel_size=(3,3),strides=(2,2),padding="same",name="initial_block_conv",kernel_initializer="he_normal")(tensor)
    pool = MaxPooling2D(pool_size=(2,2),name="initial_blokc_pool")(tensor)
    concat = concatenate([conv,pool],axis=-1,name="initial_block_concat")
    return concat

def bottleneck_encoder(tensor, nfilters, downsampling=False, dilated=False, asymmetric=False, normal=False, drate=0.1,
                       name=''):
    y = tensor
    skip = tensor
    stride = 1
    ksize = 1
    if downsampling:
        stride = 2
        ksize = 2
        skip = MaxPooling2D(pool_size=(2, 2), name=f'max_pool_{name}')(skip)
        skip = Permute((1, 3, 2), name=f'permute_1_{name}')(skip)  # (B, H, W, C) -> (B, H, C, W)
        ch_pad = nfilters - K.int_shape(tensor)[-1]
        skip = ZeroPadding2D(padding=((0, 0), (0, ch_pad)), name=f'zeropadding_{name}')(skip)
        skip = Permute((1, 3, 2), name=f'permute_2_{name}')(skip)  # (B, H, C, W) -> (B, H, W, C)

    y = Conv2D(filters=nfilters // 4, kernel_size=(ksize, ksize), kernel_initializer='he_normal',
               strides=(stride, stride), padding='same', use_bias=False, name=f'1x1_conv_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_1x1_{name}')(y)
    y = PReLU(shared_axes=[1, 2], name=f'prelu_1x1_{name}')(y)

    if normal:
        y = Conv2D(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal', padding='same',
                   name=f'3x3_conv_{name}')(y)
    elif asymmetric:
        y = Conv2D(filters=nfilters // 4, kernel_size=(5, 1), kernel_initializer='he_normal', padding='same',
                   use_bias=False, name=f'5x1_conv_{name}')(y)
        y = Conv2D(filters=nfilters // 4, kernel_size=(1, 5), kernel_initializer='he_normal', padding='same',
                   name=f'1x5_conv_{name}')(y)
    elif dilated:
        y = Conv2D(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal',
                   dilation_rate=(dilated, dilated), padding='same', name=f'dilated_conv_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_main_{name}')(y)
    y = PReLU(shared_axes=[1, 2], name=f'prelu_{name}')(y)

    y = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', use_bias=False,
               name=f'final_1x1_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_final_{name}')(y)
    y = SpatialDropout2D(rate=drate, name=f'spatial_dropout_final_{name}')(y)

    y = Add(name=f'add_{name}')([y, skip])
    y = PReLU(shared_axes=[1, 2], name=f'prelu_out_{name}')(y)

    return y

def bottleneck_decoder(tensor, nfilters, upsampling=False, normal=False, name=''):
    y = tensor
    skip = tensor
    if upsampling:
        skip = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', strides=(1, 1),
                      padding='same', use_bias=False, name=f'1x1_conv_skip_{name}')(skip)
        skip = UpSampling2D(size=(2, 2), name=f'upsample_skip_{name}')(skip)

    y = Conv2D(filters=nfilters // 4, kernel_size=(1, 1), kernel_initializer='he_normal', strides=(1, 1),
               padding='same', use_bias=False, name=f'1x1_conv_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_1x1_{name}')(y)
    y = PReLU(shared_axes=[1, 2], name=f'prelu_1x1_{name}')(y)

    if upsampling:
        y = Conv2DTranspose(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal', strides=(2, 2),
                            padding='same', name=f'3x3_deconv_{name}')(y)
    elif normal:
        Conv2D(filters=nfilters // 4, kernel_size=(3, 3), strides=(1, 1), kernel_initializer='he_normal',
               padding='same', name=f'3x3_conv_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_main_{name}')(y)
    y = PReLU(shared_axes=[1, 2], name=f'prelu_{name}')(y)

    y = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', use_bias=False,
               name=f'final_1x1_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_final_{name}')(y)

    y = Add(name=f'add_{name}')([y, skip])
    y = ReLU(name=f'relu_out_{name}')(y)

    return y

def ENET(input_shape=(None, None, 3), nclasses=11):
    print('. . . . .Building ENet. . . . .')
    img_input = Input(input_shape)

    x = initial_block(img_input)

    x = bottleneck_encoder(x, 64, downsampling=True, normal=True, name='1.0', drate=0.01)
    for _ in range(1, 5):
        x = bottleneck_encoder(x, 64, normal=True, name=f'1.{_}', drate=0.01)

    x = bottleneck_encoder(x, 128, downsampling=True, normal=True, name=f'2.0')
    x = bottleneck_encoder(x, 128, normal=True, name=f'2.1')
    x = bottleneck_encoder(x, 128, dilated=2, name=f'2.2')
    x = bottleneck_encoder(x, 128, asymmetric=True, name=f'2.3')
    x = bottleneck_encoder(x, 128, dilated=4, name=f'2.4')
    x = bottleneck_encoder(x, 128, normal=True, name=f'2.5')
    x = bottleneck_encoder(x, 128, dilated=8, name=f'2.6')
    x = bottleneck_encoder(x, 128, asymmetric=True, name=f'2.7')
    x = bottleneck_encoder(x, 128, dilated=16, name=f'2.8')

    x = bottleneck_encoder(x, 128, normal=True, name=f'3.0')
    x = bottleneck_encoder(x, 128, dilated=2, name=f'3.1')
    x = bottleneck_encoder(x, 128, asymmetric=True, name=f'3.2')
    x = bottleneck_encoder(x, 128, dilated=4, name=f'3.3')
    x = bottleneck_encoder(x, 128, normal=True, name=f'3.4')
    x = bottleneck_encoder(x, 128, dilated=8, name=f'3.5')
    x = bottleneck_encoder(x, 128, asymmetric=True, name=f'3.6')
    x = bottleneck_encoder(x, 128, dilated=16, name=f'3.7')

    x = bottleneck_decoder(x, 64, upsampling=True, name='4.0')
    x = bottleneck_decoder(x, 64, normal=True, name='4.1')
    x = bottleneck_decoder(x, 64, normal=True, name='4.2')

    x = bottleneck_decoder(x, 16, upsampling=True, name='5.0')
    x = bottleneck_decoder(x, 16, normal=True, name='5.1')

    img_output = Conv2DTranspose(nclasses, kernel_size=(2, 2), strides=(2, 2), kernel_initializer='he_normal',
                                 padding='same', name='image_output')(x)
    img_output = Activation('softmax')(img_output)

    model = Model(inputs=img_input, outputs=img_output, name='ENET')
    print('. . . . .Build Compeleted. . . . .')
    return model

Where the loss value is defined

def dice_coeff(y_true, y_pred):
    smooth = 1.
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    score = (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
    return score

def dice_loss(y_true, y_pred):
    loss = 1 - dice_coeff(y_true, y_pred)
    return loss

def total_loss(y_true, y_pred):
    loss = binary_crossentropy(y_true, y_pred) + (3*dice_loss(y_true, y_pred))
    return loss

Posted by nikifi on Mon, 18 Oct 2021 21:12:21 -0700

Programmer Group

[Special Topic on semantic segmentation] work related to semantic segmentation -- work related to ENet network

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Hot Keywords