Transformer for speech command recognition

Ideas taken from wav2vec2, arXiv:2006.11477 [cs.CL]

Configuration

Imports

load torchaudio:

For audio output/recording in the notebook:

Configuration

Dataset

DataLoader

Model

Feature Extractor

Transformer Encoder

Attention: $$ O=V\mathrm{softmax}\left(\frac{1}{\sqrt{c}}K^{\intercal}Q\right)\,, $$ where $V,K,Q\in\mathbb{R}^{c\times n}$, $c$ is the number of channels, $n$ is the number of elements. Here $K^{\intercal}Q\in\mathbb{R}^{n\times n}$.

It is possible to use torch.nn.MultiheadAttention:

Classification Model

Training

Training loop

Optimizer

Start Training

Testing

Does not work in JupyterLab: