gaunernst
/

vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

Audio Classification

Model card Files Files and versions Community

gaunernst commited on Nov 20, 2023

Commit

3058f72

•

1 Parent(s): 6746e7c

Update README.md

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -21,6 +21,7 @@ A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Superv
 ## Model Usage
 ### Audio Classification and Embeddings
 ```python
 from urllib.request import urlopen
 import timm
@@ -32,14 +33,14 @@ img = Image.open(urlopen(
     'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
 ))
-# NOTE: `global_pool='avg'` is required
 # if only embeddings are needed, pass `num_classes=0`
-model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k", pretrained=True, global_pool='avg')
 model = model.eval()
-# TODO: torchaudio.compliance.kaldi.fbank
 audio = torch.randn(1, 10 * 16_000)
-melspec = fbank(
     audio,
     htk_compat=True,
     sample_frequency=16_000,

 ## Model Usage
 ### Audio Classification and Embeddings
 ```python
 from urllib.request import urlopen
 import timm
     'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
 ))
+# NOTE: for timm<0.9.11, you also need to pass `global_pool='avg'`
 # if only embeddings are needed, pass `num_classes=0`
+model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k", pretrained=True)
 model = model.eval()
+# TODO: HF preprocessor (AST)
 audio = torch.randn(1, 10 * 16_000)
+melspec = kaldi.fbank(
     audio,
     htk_compat=True,
     sample_frequency=16_000,