|
--- |
|
license: cc-by-4.0 |
|
library_name: timm |
|
pipeline_tag: audio-classification |
|
--- |
|
|
|
# Model card for vit_base_patch16_1024_128.audiomae_as2m_ft_as20k |
|
|
|
A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method, and fine-tuned on AudioSet-20k. |
|
|
|
- This is a port of AudioMAE ViT-B/16 weights for usage with `timm`. The naming convention is adopted from other `timm`'s ViT models. |
|
- See the original repo here: https://github.com/facebookresearch/AudioMAE |
|
- For the AudioSet-2M pre-trained checkpoint (without Audioset-20k fine-tuning), see https://huggingface.co/gaunernst/vit_base_patch16_1024_128.audiomae_as2m |
|
|
|
|
|
## Model Details |
|
- **Model Type:** Audio classification / feature backbone |
|
- **Papers:** |
|
- Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405 |
|
- **Pretrain Dataset:** AudioSet-2M |
|
- **Original:** https://github.com/facebookresearch/AudioMAE |
|
|
|
## Model Usage |
|
### Audio Classification and Embeddings |
|
|
|
```python |
|
import timm |
|
import torch |
|
import torch.nn.functional as F |
|
from torchaudio.compliance import kaldi |
|
|
|
# NOTE: for timm<0.9.11, you also need to pass `global_pool='avg'` |
|
# if only embeddings are needed, pass `num_classes=0` |
|
model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k", pretrained=True) |
|
model = model.eval() |
|
|
|
MEAN = -4.2677393 |
|
STD = 4.5689974 |
|
|
|
audio = torch.randn(1, 10 * 16_000) # make sure input is 16kHz |
|
melspec = kaldi.fbank(audio, htk_compat=True, window_type="hanning", num_mel_bins=128) # shape (n_frames, 128) |
|
|
|
# AudioMAE only accepts 1024-frame input |
|
if melspec.shape[0] < 1024: |
|
melspec = F.pad(melspec, (0, 0, 0, 1024 - melspec.shape[0])) |
|
else: |
|
melspec = melspec[:1024] |
|
melspec = (melspec - MEAN) / (STD * 2) |
|
|
|
melspec = melspec.view(1, 1, 1024, 128) # add batch dim and channel dim |
|
output = model(melspec) |
|
|
|
# for classification |
|
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5) |
|
|
|
# for embeddings |
|
output # shape (1, 768) |
|
``` |
|
|
|
## Citation |
|
```bibtex |
|
@inproceedings{huang2022amae, |
|
title = {Masked Autoencoders that Listen}, |
|
author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph} |
|
booktitle = {NeurIPS}, |
|
year = {2022} |
|
} |
|
``` |
|
```bibtex |
|
@misc{rw2019timm, |
|
author = {Ross Wightman}, |
|
title = {PyTorch Image Models}, |
|
year = {2019}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
doi = {10.5281/zenodo.4414861}, |
|
howpublished = {\url{https://github.com/huggingface/pytorch-image-models}} |
|
} |
|
``` |