CV--Swin TransFromer, you should change the Backbone again

Keywords: Computer Vision Object Detection paddlepaddle

previously on

Hello everyone, VIsion Transformer has reached its sixth day. With the passage of time, the difficulty is getting greater and greater. I believe you have learned from ResNet, ViT Transformer and Diet. Today, you have learned the best ICCV paper Swin Transformer in 2021. Swin Transformer directly killed the list on major pound lists this year. Swin was used to brush the list in the top 8 of COCO verification set, Swin's original paper is the eighth, and the first seven are all swin's re innovations, https://paperswithcode.com/sota/object-detection-on-coco-minival?p=end-to-end-semi-supervised-object-detection
. If today's notes can help you, it's definitely my great honor.

The difference between Swin and VIT

The difference between Swin and Vit can be illustrated by a figure

Compared with Vit's direct down sampling of 16 times, swing uses the concept of window and uses different down sampling times, and the calculation of different windows does not interfere with each other. In this way, compared with Vit, the amount of calculation is greatly reduced and the performance is effectively improved

Patch layer explanation

If you have studied vit, you should know that vit has a PatchEmbedding layer, which is responsible for mapping a batch of RGB images from four dimensions to three dimensions

[B,C,H,W]->[B,H*W/N/N,C*N*N]

The responsibility of PatchEmbedding in swing is assumed by the Patch Partition. Since Mr. Zhu's ppt has not been released yet, the PPT of station B up thunderbolt bar Wz quoted here.

Unlike VIt, the swing transformer uses multiple down sampling, and the first down sampling uses a 4x4 convolution kernel

After that, the number of channels will be converted from 48 to C through a full link layer, also known as Linear Embedding layer, and its overall view is.

However, in the code implementation of Swin Transformer, Patch Partition and Linear Embedding are directly combined into one, and a convolution core with 4x4 size and stripe of 4 is directly used to convert the number of channels from 3 to C.

import paddle
import paddle.nn as nn

class PatchEmbed(nn.Layer):
    """
    2D Image to Patch Embedding
    """
    def __init__(self, patch_size=4, in_c=3, embed_dim=96, norm_layer=nn.LayerNorm):
        super().__init__()
        patch_size = (patch_size, patch_size)
        self.patch_size = patch_size
        self.in_chans = in_c
        self.embed_dim = embed_dim
        self.proj = nn.Conv2D(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
    def forward(self, x):
        _, _, H, W = x.shape

        # # padding
        # # If the H and W of the input picture are not patch_ An integer multiple of size. padding is required
        # pad_input = (H % self.patch_size[0] != 0) or (W % self.patch_size[1] != 0)
        # if pad_input:
        #     # to pad the last 3 dimensions,
        #     # (W_left, W_right, H_top,H_bottom, C_front, C_back)
        #     x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1],
        #                   0, self.patch_size[0] - H % self.patch_size[0],
        #                   0, 0))
         # Down sampling patch_size times
        x = self.proj(x)
        _, _, H, W = x.shape
        # flatten: [B, C, H, W] -> [B, C, HW]
        # transpose: [B, C, HW] -> [B, HW, C]
        x = paddle.transpose(x.flatten(2),(0,2,1))
        x = self.norm(x)
        print(x.shape)
        return x, H, W

model = PatchEmbed()
paddle.summary(model,(8,3,224,224))

[8, 3136, 96]
---------------------------------------------------------------------------
 Layer (type)       Input Shape          Output Shape         Param #    
===========================================================================
   Conv2D-1      [[8, 3, 224, 224]]    [8, 96, 56, 56]         4,704     
  LayerNorm-1     [[8, 3136, 96]]       [8, 3136, 96]           192      
===========================================================================
Total params: 4,896
Trainable params: 4,896
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 4.59
Forward/backward pass size (MB): 36.75
Params size (MB): 0.02
Estimated Total Size (MB): 41.36
---------------------------------------------------------------------------






{'total_params': 4896, 'trainable_params': 4896}

PatchMerging explanation

Because Swin requires the manager to downsample four times, the first downsample is by patchemedding
Layer, and then there are three times by PatchMerging. PatchMerging will double the width and height of the feature map again, and make the number of channels Cx2. The illustration quoted here is also the PPT of station B up thunderbolt Wz.

First, we use a 2x2 convolution kernel as the window, and each window has four pixels. Then I will take out the pixels at the same position of each window and get four characteristic matrices like the second one. Then, we will splice the four characteristic matrices at latitude C to get 4xC, and then carry out a LinearNorm layer, Finally, a Linear mapping is performed through a Linear layer to become 2xC. such
The shape of X becomes

[H/4,W/4,C]->[H/8,W/8,2*C]

class PatchMerging(nn.Layer):
    r""" Patch Merging Layer.
    Args:
        dim (int): Number of input channels.
        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
    """

    def __init__(self, dim, norm_layer=nn.LayerNorm):
        super().__init__()
        self.dim = dim
        self.reduction = nn.Linear(4 * dim, 2 * dim, bias_attr=False)
        self.norm = norm_layer(4 * dim)

    def forward(self, x, H=224//4, W=224//4):
        """
        x: B, H*W, C
        """
        B, L, C = x.shape
        print(x.shape)
        assert L == H * W, "input feature has wrong size"

        x = paddle.reshape(x,(B, H, W, C))

        # # padding
        # # If the input H and W of the feature map are not integral multiples of 2, padding is required
        # pad_input = (H % 2 == 1) or (W % 2 == 1)
        # if pad_input:
        #     # to pad the last 3 dimensions, starting from the last dimension and moving forward.
        #     # (C_front, C_back, W_left, W_right, H_top, H_bottom)
        #     # Note that the Tensor channel here is [B, H, W, C], so it is somewhat different from the official document
        #     x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2))

        x0 = x[:, 0::2, 0::2, :]  # [B, H/2, W/2, C]
        x1 = x[:, 1::2, 0::2, :]  # [B, H/2, W/2, C]
        x2 = x[:, 0::2, 1::2, :]  # [B, H/2, W/2, C]
        x3 = x[:, 1::2, 1::2, :]  # [B, H/2, W/2, C]
        x = paddle.concat([x0, x1, x2, x3], -1)  # [B, H/2, W/2, 4*C]
        x = paddle.reshape(x,(B, -1, 4 * C))  # [B, H/2*W/2, 4*C]

        x = self.norm(x)
        x = self.reduction(x)  # [B, H/2*W/2, 2*C]
        print(x.shape)
        return x

model = PatchMerging(96)
paddle.summary(model,(8,3136,96))

[8, 3136, 96]
[8, 784, 192]
---------------------------------------------------------------------------
 Layer (type)       Input Shape          Output Shape         Param #    
===========================================================================
  LayerNorm-2     [[8, 784, 384]]       [8, 784, 384]           768      
   Linear-1       [[8, 784, 384]]       [8, 784, 192]         73,728     
===========================================================================
Total params: 74,496
Trainable params: 74,496
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 9.19
Forward/backward pass size (MB): 27.56
Params size (MB): 0.28
Estimated Total Size (MB): 37.03
---------------------------------------------------------------------------






{'total_params': 74496, 'trainable_params': 74496}

Swin TansFormer Block explanation

Similar to the Block layer of VIT TransFormer, swing transformer uses S-MSA and SW-MSA instead of MSA, and nothing else has changed

S-SMA and SW-MSA are a little difficult for the time being. Let's explain them tomorrow. Today, let's implement the whole Block module first

It can be seen that the Block layer first passes through a LayerNorm, then S-MSA or SW-MSA, then enters a Dropout or DropPath layer, then performs a short circuit, then performs a LayerNrom, then an Mlp, then a Dropout or DropPath layer, and then a short circuit. Such a Block layer is realized.

Next, I'll implement the Mlp layer before.

Mlp explanation

Mr. Zhu has made it clear before Mlp that a Linear layer will count the number of channels × 4. Then go through GELU activation function, then go through a Dropout layer, and then go through a Linear layer to change the number of channels back to the original number, and then go through a Dropout layer

class Mlp(nn.Layer):
    """ MLP as used in Vision Transformer, MLP-Mixer and related networks
    """
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features

        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.drop1 = nn.Dropout(drop)
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop2 = nn.Dropout(drop)

    def forward(self, x):
        print(x.shape)
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop1(x)
        x = self.fc2(x)
        x = self.drop2(x)
        print(x.shape)
        return x

model = Mlp(768)
paddle.summary(model,(8,197,768))

[8, 197, 768]
[8, 197, 768]
---------------------------------------------------------------------------
 Layer (type)       Input Shape          Output Shape         Param #    
===========================================================================
   Linear-2       [[8, 197, 768]]       [8, 197, 768]         590,592    
    GELU-1        [[8, 197, 768]]       [8, 197, 768]            0       
   Dropout-1      [[8, 197, 768]]       [8, 197, 768]            0       
   Linear-3       [[8, 197, 768]]       [8, 197, 768]         590,592    
   Dropout-2      [[8, 197, 768]]       [8, 197, 768]            0       
===========================================================================
Total params: 1,181,184
Trainable params: 1,181,184
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 4.62
Forward/backward pass size (MB): 46.17
Params size (MB): 4.51
Estimated Total Size (MB): 55.29
---------------------------------------------------------------------------






{'total_params': 1181184, 'trainable_params': 1181184}

SwinTransformerBlock layer implementation

The following is the implementation of the Block layer. Here is a WindowAttention, which is the implementation of S-MSA and SW-MSA. Let's briefly define it here without any content.

class WindowAttention(nn.Layer):
    def __init__(self,dim, window_size, num_heads, qkv_bias,attn_drop, proj_drop):
        super().__init__()
        pass
    def forward(self,x):
        return x,x

class SwinTransformerBlock(nn.Layer):
    """ Swin Transformer Block.
        dim (int): Number of input channels.
        num_heads (int): Number of attention heads.
        window_size (int): Window size.
        shift_size (int): Shift size for SW-MSA.
        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
        drop (float, optional): Dropout rate. Default: 0.0
        attn_drop (float, optional): Attention dropout rate. Default: 0.0
        drop_path (float, optional): Stochastic depth rate. Default: 0.0
        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
    """
    def __init__(self, dim, num_heads, window_size=7, shift_size=0,
            mlp_ratio=4., qkv_bias=True, drop=0., attn_drop=0., drop_path=0.,
            act_layer=nn.GELU, norm_layer=nn.LayerNorm):
        super().__init__()
        self.dim = dim
        self.num_heads = num_heads
        self.window_size = window_size
        self.shift_size = shift_size
        self.mlp_ratio = mlp_ratio
        assert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"

        self.norm1 = norm_layer(dim)
        self.attn = WindowAttention(
            dim, window_size=(self.window_size, self.window_size), num_heads=num_heads, qkv_bias=qkv_bias,
            attn_drop=attn_drop, proj_drop=drop)

        self.drop_path = nn.Dropout(0) #if drop_path > 0. else nn.Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)

    def forward(self, x):
        H, W = 56, 56
        B, L, C = x.shape
        assert L == H * W, "input feature has wrong size"

        shortcut = x
        x = self.norm1(x)
        x = paddle.reshape(x,(B, H, W, C))

        # # pad feature maps to multiples of window size
        # # Give the feature map to the pad to an integer multiple of the window size
        # pad_l = pad_t = 0
        # pad_r = (self.window_size - W % self.window_size) % self.window_size
        # pad_b = (self.window_size - H % self.window_size) % self.window_size
        # x = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))
        _, Hp, Wp, _ = x.shape

#--------------------------------------------------------------------------------------------------#
#                           S-MSA will not handle it for the time being. We'll talk about it next time
#--------------------------------------------------------------------------------------------------#
        # # cyclic shift
        # if self.shift_size > 0:
        #     shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))
        # else:
        #     shifted_x = x
        #     attn_mask = None

        # # partition windows
        # x_windows = window_partition(shifted_x, self.window_size)  # [nW*B, Mh, Mw, C]
        # x_windows = x_windows.view(-1, self.window_size * self.window_size, C)  # [nW*B, Mh*Mw, C]

        # # W-MSA/SW-MSA
        # attn_windows = self.attn(x_windows, mask=attn_mask)  # [nW*B, Mh*Mw, C]

        # # merge windows
        # attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)  # [nW*B, Mh, Mw, C]
        # shifted_x = window_reverse(attn_windows, self.window_size, Hp, Wp)  # [B, H', W', C]

        # # reverse cyclic shift
        # if self.shift_size > 0:
        #     x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))
        # else:
        #     x = shifted_x

        # if pad_r > 0 or pad_b > 0:
        #     # Remove the data of the previous pad
        #     x = x[:, :H, :W, :].contiguous()

#--------------------------------------------------------------------------------------------------#
        x = paddle.reshape(x,(B,H*W,C))

        x = shortcut + self.drop_path(x)
        x = x + self.drop_path(self.mlp(self.norm2(x)))

        return x

model = SwinTransformerBlock(96,3)
paddle.summary(model,(8,3136,96))

[8, 3136, 96]
[8, 3136, 96]
---------------------------------------------------------------------------
 Layer (type)       Input Shape          Output Shape         Param #    
===========================================================================
  LayerNorm-3     [[8, 3136, 96]]       [8, 3136, 96]           192      
   Dropout-3      [[8, 3136, 96]]       [8, 3136, 96]            0       
  LayerNorm-4     [[8, 3136, 96]]       [8, 3136, 96]           192      
   Linear-4       [[8, 3136, 96]]       [8, 3136, 384]        37,248     
    GELU-2        [[8, 3136, 384]]      [8, 3136, 384]           0       
   Dropout-4      [[8, 3136, 384]]      [8, 3136, 384]           0       
   Linear-5       [[8, 3136, 384]]      [8, 3136, 96]         36,960     
   Dropout-5      [[8, 3136, 96]]       [8, 3136, 96]            0       
     Mlp-2        [[8, 3136, 96]]       [8, 3136, 96]            0       
===========================================================================
Total params: 74,592
Trainable params: 74,592
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 9.19
Forward/backward pass size (MB): 330.75
Params size (MB): 0.28
Estimated Total Size (MB): 340.22
---------------------------------------------------------------------------






{'total_params': 74592, 'trainable_params': 74592}

summary

Today, we have completed all the knowledge except S-MSA and SW-MSA. Tomorrow, we will realize S-MSA and SW-MSA. The above is personal understanding. If there is any error, please correct it.

The above pictures are all from station B up main thunderbolt bar Wz, thank you very much.

I'm Lao Mengxin

I got the gold level in AI Studio and lit 5 badges to pass each other~ https://aistudio.baidu.com/aistudio/personalcenter/thirdview/553083

Posted by mlavwilson on Fri, 03 Dec 2021 18:22:11 -0800

Programmer Group