gaunernst
/

vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

Audio Classification

Model card Files Files and versions Community

vit_base_patch16_1024_128.audiomae_as2m_ft_as20k / README.md

gaunernst's picture

fix typo

8dd1730 verified 7 months ago

|

history blame contribute delete

No virus

2.61 kB

	---
	license: cc-by-4.0
	library_name: timm
	pipeline_tag: audio-classification
	---

	# Model card for vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

	A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method, and fine-tuned on AudioSet-20k.

	- This is a port of AudioMAE ViT-B/16 weights for usage with `timm`. The naming convention is adopted from other `timm`'s ViT models.
	- See the original repo here: https://github.com/facebookresearch/AudioMAE
	- For the AudioSet-2M pre-trained checkpoint (without Audioset-20k fine-tuning), see https://huggingface.co/gaunernst/vit_base_patch16_1024_128.audiomae_as2m


	## Model Details
	- Model Type: Audio classification / feature backbone
	- Papers:
	- Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405
	- Pretrain Dataset: AudioSet-2M
	- Original: https://github.com/facebookresearch/AudioMAE

	## Model Usage
	### Audio Classification and Embeddings

	```python
	import timm
	import torch
	import torch.nn.functional as F
	from torchaudio.compliance import kaldi

	# NOTE: for timm<0.9.11, you also need to pass `global_pool='avg'`
	# if only embeddings are needed, pass `num_classes=0`
	model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k", pretrained=True)
	model = model.eval()

	MEAN = -4.2677393
	STD = 4.5689974

	audio = torch.randn(1, 10 * 16_000) # make sure input is 16kHz
	melspec = kaldi.fbank(audio, htk_compat=True, window_type="hanning", num_mel_bins=128) # shape (n_frames, 128)

	# AudioMAE only accepts 1024-frame input
	if melspec.shape[0] < 1024:
	melspec = F.pad(melspec, (0, 0, 0, 1024 - melspec.shape[0]))
	else:
	melspec = melspec[:1024]
	melspec = (melspec - MEAN) / (STD * 2)

	melspec = melspec.view(1, 1, 1024, 128) # add batch dim and channel dim
	output = model(melspec)

	# for classification
	top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

	# for embeddings
	output # shape (1, 768)
	```

	## Citation
	```bibtex
	@inproceedings{huang2022amae,
	title = {Masked Autoencoders that Listen},
	author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph}
	booktitle = {NeurIPS},
	year = {2022}
	}
	```
	```bibtex
	@misc{rw2019timm,
	author = {Ross Wightman},
	title = {PyTorch Image Models},
	year = {2019},
	publisher = {GitHub},
	journal = {GitHub repository},
	doi = {10.5281/zenodo.4414861},
	howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
	}
	```