A ConvNet for the 2020s

Context
- This paper was published shortly after the hype explosion around vision transformers (mainly the vanilla ViT and the Swin Transformer) as an alternative to convolutional networks.
 
Takeaways
- 
    
A “modernized” ResNet (named ConvNeXt) can perform equally or superiorly to ViTs.
 - 
    
The ConvNeXt also performed as well as ViTs when pre-trained on large datasets, which challenges the view that ViTs are better at scaling up.
 - 
    
In general, CNNs have a more straightforward design (less specilized modules) and do not use the global attention mechanism, which has a quadratic complexity dependency on the input size.
 - Proposed modifications to the ResNet architecture:
    
- Change the number of blocks per stage to (3, 3, 9, 3). Originally it was (3, 4, 6, 3).
 - Use a 4 x 4 convolution with stride 4 (non-overlapping convolution) as the ResNet stem cell.
 - Use grouped convolutions (a là ResNext), more specifically, depthwise convolutions, in the first layer of the block.
 - Use larger convolutions — 7 x 7 instead of 3 x 3.
 - Restructure the block as an inverted bottleneck: a layer with many filters sandwiched between two layers with fewer filters.
 - Use just one activation function per block and replace the ReLU with GELU.
 - Use layer norm instead of batch norm.
 - Insert spatial downsampling layers (2 x 2 conv with stride 2) and normalization layers between each stage.
 - The figure above (taken from the paper) shows the ConvNeXt block compared to the ResNet’s and the Swin Transformer’s.
 
 - The training recipe was also modified:
    
- 300 epochs instead of 90.
 - AdamW optimizer.
 - Learning rate linear warm-up followed by a cosine decay.
 - Data augmentation: mixup, cutmix, randaugment, and random erasing.
 - Regularization with stochastic depth and label smoothing.