A ConvNet for the 2020s

drawing

Context

This paper was published shortly after the hype explosion around vision transformers (mainly the vanilla ViT and the Swin Transformer) as an alternative to convolutional networks.

Takeaways

A “modernized” ResNet (named ConvNeXt) can perform equally or superiorly to ViTs.
The ConvNeXt also performed as well as ViTs when pre-trained on large datasets, which challenges the view that ViTs are better at scaling up.
In general, CNNs have a more straightforward design (less specilized modules) and do not use the global attention mechanism, which has a quadratic complexity dependency on the input size.
Proposed modifications to the ResNet architecture:
- Change the number of blocks per stage to (3, 3, 9, 3). Originally it was (3, 4, 6, 3).
- Use a 4 x 4 convolution with stride 4 (non-overlapping convolution) as the ResNet stem cell.
- Use grouped convolutions (a là ResNext), more specifically, depthwise convolutions, in the first layer of the block.
- Use larger convolutions — 7 x 7 instead of 3 x 3.
- Restructure the block as an inverted bottleneck: a layer with many filters sandwiched between two layers with fewer filters.
- Use just one activation function per block and replace the ReLU with GELU.
- Use layer norm instead of batch norm.
- Insert spatial downsampling layers (2 x 2 conv with stride 2) and normalization layers between each stage.
- The figure above (taken from the paper) shows the ConvNeXt block compared to the ResNet’s and the Swin Transformer’s.
The training recipe was also modified:
- 300 epochs instead of 90.
- AdamW optimizer.
- Learning rate linear warm-up followed by a cosine decay.
- Data augmentation: mixup, cutmix, randaugment, and random erasing.
- Regularization with stochastic depth and label smoothing.