ConvNet for the 2020s

An aggregation of tricks / designs on ResNet so it can outperform Swin Transformer. The authors gradually add in more and more “modernization” pieces.

Training techniques

Macro design

  • Stage ratio: Adjust the number of blocks in each stage from to . This is adapted from Swin-T’s similar numbers. It uses , and for larger ones, .
  • “patchify stem”: ViT’s style is like kernel size (14 or 16 for a patch) with non-overlapping convolutinon. Swin Transformer’s approach is smaller ones for multi-stage. Here we use a similar approach, stride convolution layer (non-overlapping), for stem which is at the network’s beginning. Comment: We’ll see Large kernel size later. The mentioned here is the how many patches Swin Transformer splits in an image, but each patch is . Plus, the performance doesn’t change much. I don’t think this is something good.

ResNext-ify

  • Use ResNext’s idea of grouped convolution. Here we use depthwise convolution, a special case for grouped convolution where the number of groups equals the number of channels. The result is increased FLOPs.

Inverted bottleneck

  • Transformers have it. So we’re gonna have this too.(Though it doesn’t bring much performance gain, and make the network runs slower).
  • In the following (c), it’s the configuration after we do something afterwards. More random. Comment: Another nonsense, only make things slower.

Large kernel size

  • So larger kernel size means more computation, so we adapt that (c) above. The authors argue this is also learned from Transformers: self attention layer is prior to MLP.
  • And they tried different kernel size, and… is the best. Comment: what a surprise. I mean Swin Transformer is .

Micro design

  • Replace RELU with GELU. Exactly the same accuracy.
  • Delete GELU in the blocks except for one. Interestingly improves performance. This is suggested by Transformers has fewer activation functions.
  • Also delete a BatchNorm layer. Slight improvement.
  • Replace BachNorm with LayerNorm. With all the changes previously we got a better performance now.
  • Separate out the downsampling layers. Between each stage, conv layers with stride for spatial downsampling. For ResNet it was , at the start of each stage. To stabalize training, some LayerNorm layers are added.