topel
/

ConvNeXt-Tiny-AT

Safetensors

audio tagging

audio events

audio embeddings

convnext-audio

audioset

Model card Files Files and versions Community

topel commited on Sep 25, 2023

Commit

d16a293

1 Parent(s): bf0bb92

Update Readme

Browse files

Files changed (1) hide show

README.md +135 -1

README.md CHANGED Viewed

@@ -12,10 +12,144 @@ extra_gated_fields:
   Company/university: text
   Website: text
 ---
 ```python
 from audioset_convnext_inf.pytorch.convnext import ConvNeXt
-model = ConvNeXt.from_pretrained(model_fpath, use_auth_token=None, map_location='cpu', use_auth_token="ACCESS_TOKEN_GOES_HERE")
 ```

   Company/university: text
   Website: text
 ---
+ConvNeXt-Tiny-AT is an audio tagging CNN model, trained on AudioSet (balanced+unbalanced subsets). It reached 0.471 mAP on the test set.
+The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
+It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
+Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
+The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
+# Install
+This code is based on our repo: https://github.com/topel/audioset-convnext-inf
+Note that the checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
+```bash
+pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
+```
+# Usage
+Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
 ```python
+# 1. visit hf.co/topel/ConvNeXt-Tiny-AT and accept user conditions
+# 2. visit hf.co/settings/tokens to create an access token
+# 3. instantiate pretrained model
+import os
+import numpy as np
+import torch
+import torchaudio
 from audioset_convnext_inf.pytorch.convnext import ConvNeXt
+model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", use_auth_token=None, map_location='cpu', use_auth_token="ACCESS_TOKEN_GOES_HERE")
+print(
+    "# params:",
+    sum(param.numel() for param in model.parameters() if param.requires_grad),
+)
+if torch.cuda.is_available():
+    device = torch.device("cuda")
+else:
+    device = torch.device("cpu")
+if "cuda" in str(device):
+    model = model.to(device)
+```
+Output:
+```
+# params: 28222767
+```
+## Inference: get logits and probabilities
+```python
+sample_rate = 32000
+audio_target_length = 10 * sample_rate  # 10 s
+AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
+AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
+waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
+if sample_rate_ != sample_rate:
+    print("ERROR: sampling rate not 32k Hz", sample_rate_)
+waveform = waveform.to(device)
+print("\nInference on " + AUDIO_FNAME + "\n")
+with torch.no_grad():
+    model.eval()
+    output = model(waveform)
+logits = output["clipwise_logits"]
+print("logits size:", logits.size())
+probs = output["clipwise_output"]
+# Equivalent: probs = torch.sigmoid(logits)
+print("probs size:", probs.size())
+threshold = 0.25
+sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
+print("Predicted labels using activity threshold 0.25:\n")
+print(sample_labels)
+```
+Output:
+```
+logits size: torch.Size([1, 527])
+probs size: torch.Size([1, 527])
+Predicted labels using activity threshold 0.25:
+[  0 137 138 139 151 506]
+```
+## Get audio scene embeddings
+```python
+with torch.no_grad():
+    model.eval()
+    output = model.forward_scene_embeddings(waveform)
+print("\nScene embedding, shape:", output.size())
+```
+Output:
+```
+Scene embedding, shape: torch.Size([1, 768])
+```
+## Get frame-level embeddings
+```python
+with torch.no_grad():
+    model.eval()
+    output = model.forward_frame_embeddings(waveform)
+print("\nFrame-level embeddings, shape:", output.size())
+```
+Output:
+```
+Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
+```
+## Citation
+```bibtex
+@inproceedings{Bredin2021,
+  Title = {{Adapting a ConvNeXt model to audio classification on AudioSet}},
+  Author = {{Pellegrini}, Thomas, and {Khalfaoui-Hassani}, Ismail and {Labb\'e}, Etienne and {Masquelier}, Timoth\'e},
+  Booktitle = {Proc. Interspeech 2023},
+  Address = {Dublin},
+  Month = {August},
+  Year = {2023},
 ```