Visual transformer on CIFAR10

Main ideas:

Configuration

Imports

Configuration

Data

Model

Utilities

Transformer

Attention

$$ O = V \mathrm{softmax}\left[\frac{1}{\sqrt{c}}K^{\intercal}Q + R\right] $$

Embedding of patches

Main model

CutMix

Training

Optimizer

Setup trainer

Start training

Model without attention

Only input-independent weights (relative position encoding) remain:

Start training