An Impartial Take to the CNN vs Transformer Robustness Contest
Context
-
It is often claimed that vision transformers (ViT) surpass convolutional networks (CNN) in calibration, robustness to covariate shift, and out-of-distribution (OoD) performance.
-
This paper questions the methods to reach these conclusions and proposes new experiments.
-
The models compared were: ConvNeXt and BiT (CNNs) vs. vanilla ViT and Swin Transformer (ViTs).
Takeaways
-
ViTs and CNNs are both susceptible to simplicity bias (a.k.a. shortcut learning). See figure above.
-
ViTs and CNNs perform just as well on OoD detection tasks.
-
No single model exhibited the lowest Expected Calibration Error (ECE) in all covariate-shift experiments. Also, the model with the highest accuracy is not the most calibrated.
-
A low ECE is not enough to assess a classifier’s reliability. It is better to complement this analysis with other techniques, such as the Prediction Rejection Ratio (PRR).
-
The robustness contest between CNNs and ViTs seems to have no clear winner.