Bottleneck Transformer

According to "Bottleneck Transformers for Visual Recognition", arXiv:2101.11605 [cs.CV]

Parts of code from https://gist.github.com/aravindsrinivas/56359b79f0ce4449bcb04ab4b56a57a2

classification.png

botnet_block.png

comparison.png

BoT50 warrants longer training in order to show significant improvement over R50.

BoTNet (self-attention) benefits more from extra augmentations such as multi-scale jitter compared to ResNet (pure convolutions).

Self-attention replacement is more efficient than stacking convolutions.

Replacing 3 spatial convolutions with all2all attention gives more improvement in the metrics compared to stacking 50 more layers of convolutions (R101), and is competitive with stacking 100 more layers (R152).

BoT50 does not provide significant gains over R50 on ImageNet though it does provide the benefit of reducing the parameters while maintaining comparable computation.

ResNets and SENets perform really well in the lower accuracy regime, outperforming both EfficientNets and BoTNets.

EfficientNets may be better in terms of M.Adds, but do not map as well as BoTNets, onto the latest hardware accelerators such as TPUs.

ResNets and SENets achieve strong performance in the improved EfficientNet training setting. They are strong enough that they can outperform all the EfficientNets.

Pure convolutional models such as ResNets and SENets are still the best performing models until an accuracy regime of 83% top-1 accuracy.

We recommend using absolute position encodings for image classification.

Configuration

Imports

Configuration

Data

Model

Self Attention for vision

Attention: $$ O=V\mathrm{softmax}\left[\frac{1}{\sqrt{c}}(K^{\intercal}Q + P^{\intercal}Q)\right]\,. $$ Here $P$ represents position encoding. $V,K,P,Q\in\mathbb{R}^{c\times n}$, $c$ is the number of channels, $n$ is the number of elements. Thus $K^{\intercal}Q,P^{\intercal}Q\in\mathbb{R}^{n\times n}$.

ResNet

Initialize position encoding:

Training

History

Optimizer

Setup trainer

Start training