CoAtNet

According to arXiv:2106.04803 [cs.CV]

In terms of generalization capability $$ \text{C-C-C-C} \approx \text{C-C-C-T} \geq \text{C-C-T-T} > \text{C-T-T-T} \gg \text{VIT}_{\mathrm{REL}} $$

For model capacity $$ \text{C-C-T-T} \approx \text{C-T-T-T} > \text{VIT}_{\mathrm{REL}} > \text{C-C-C-T} > \text{C-C-C-C} $$

For transferability $$ \text{C-C-T-T} > \text{C-T-T-T} $$

coatnet.png

Configuration

Imports

Configuration

Data

Model

Utilities

Squeeze-and-Excitation, arXiv:1709.01507 [cs.CV]

MobileNetV2, arXiv:1801.04381 [cs.CV]

MnasNet, arXiv:1807.11626 [cs.CV]

Attention

$$ O = V \mathrm{softmax}\left[\frac{1}{\sqrt{c}}K^{\intercal}Q\right] $$

Calculation of indices for relative position encoding

Flattening of indices: $i,j \rightarrow W i +j$

We want $P[W i + j, W i' + j'] = w[i - i', j - j']$

Since we index the elements of $w$ from $0$, then $P[W i + j, W i' + j'] = w[i - i' + H - 1 , j - j' + W - 1]$

Flattening: $$ P[(W i + j) H W + W i' + j'] = w[(i - i' + H - 1) (2 W - 1) + j - j' + W - 1] $$

Transformer

Full model

Training

Optimizer

Setup trainer

Start training