File size: 2,605 Bytes
0c9aeaf
 
a4c61e8
4566613
0c9aeaf
a4c61e8
 
 
787dde8
 
8dd1730
59db684
 
787dde8
a4c61e8
 
 
 
 
 
 
 
 
ba5e115
3058f72
a4c61e8
 
ba5e115
ad5370b
 
a4c61e8
3058f72
ba5e115
3058f72
a4c61e8
 
81e2a6f
 
 
ad5370b
 
 
 
 
 
 
 
81e2a6f
ba5e115
 
 
 
 
a4c61e8
 
ba5e115
 
a4c61e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4566613
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: cc-by-4.0
library_name: timm
pipeline_tag: audio-classification
---

# Model card for vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method, and fine-tuned on AudioSet-20k.

- This is a port of AudioMAE ViT-B/16 weights for usage with `timm`. The naming convention is adopted from other `timm`'s ViT models.
- See the original repo here: https://github.com/facebookresearch/AudioMAE
- For the AudioSet-2M pre-trained checkpoint (without Audioset-20k fine-tuning), see https://huggingface.co/gaunernst/vit_base_patch16_1024_128.audiomae_as2m


## Model Details
- **Model Type:** Audio classification / feature backbone
- **Papers:**
  - Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405
- **Pretrain Dataset:** AudioSet-2M
- **Original:** https://github.com/facebookresearch/AudioMAE

## Model Usage
### Audio Classification and Embeddings

```python
import timm
import torch
import torch.nn.functional as F
from torchaudio.compliance import kaldi

# NOTE: for timm<0.9.11, you also need to pass `global_pool='avg'`
# if only embeddings are needed, pass `num_classes=0`
model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k", pretrained=True)
model = model.eval()

MEAN = -4.2677393
STD = 4.5689974

audio = torch.randn(1, 10 * 16_000)  # make sure input is 16kHz
melspec = kaldi.fbank(audio, htk_compat=True, window_type="hanning", num_mel_bins=128)  # shape (n_frames, 128)

# AudioMAE only accepts 1024-frame input
if melspec.shape[0] < 1024:
    melspec = F.pad(melspec, (0, 0, 0, 1024 - melspec.shape[0]))
else:
    melspec = melspec[:1024]
melspec = (melspec - MEAN) / (STD * 2)

melspec = melspec.view(1, 1, 1024, 128)  # add batch dim and channel dim
output = model(melspec)

# for classification
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

# for embeddings
output  # shape (1, 768)
```

## Citation
```bibtex
@inproceedings{huang2022amae,
  title = {Masked Autoencoders that Listen},
  author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph}
  booktitle = {NeurIPS},
  year = {2022}
}
```
```bibtex
@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}
```