Update Readme
Browse files
README.md
CHANGED
@@ -6,13 +6,9 @@ tags:
|
|
6 |
- audio embeddings
|
7 |
- convnext-audio
|
8 |
- audioset
|
9 |
-
inference: false
|
10 |
-
extra_gated_prompt: "The collected information will help acquire a better knowledge of who is using our audio event tools. If relevant, please cite our Interspeech 2023 paper."
|
11 |
-
extra_gated_fields:
|
12 |
-
Company/university: text
|
13 |
-
Website: text
|
14 |
---
|
15 |
-
|
|
|
16 |
|
17 |
The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
|
18 |
It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
|
@@ -23,8 +19,6 @@ The scene embedding is obtained from the frame-level embeddings, on which mean p
|
|
23 |
|
24 |
This code is based on our repo: https://github.com/topel/audioset-convnext-inf
|
25 |
|
26 |
-
Note that the checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
|
27 |
-
|
28 |
|
29 |
```bash
|
30 |
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
|
@@ -35,10 +29,6 @@ pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
|
|
35 |
Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
|
36 |
|
37 |
```python
|
38 |
-
# 1. visit hf.co/topel/ConvNeXt-Tiny-AT and accept user conditions
|
39 |
-
# 2. visit hf.co/settings/tokens to create an access token
|
40 |
-
# 3. instantiate pretrained model
|
41 |
-
|
42 |
import os
|
43 |
import numpy as np
|
44 |
import torch
|
@@ -69,7 +59,6 @@ Output:
|
|
69 |
## Inference: get logits and probabilities
|
70 |
|
71 |
```python
|
72 |
-
|
73 |
sample_rate = 32000
|
74 |
audio_target_length = 10 * sample_rate # 10 s
|
75 |
|
@@ -140,8 +129,16 @@ Output:
|
|
140 |
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
|
141 |
```
|
142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
|
144 |
-
|
145 |
|
146 |
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
|
147 |
|
|
|
6 |
- audio embeddings
|
7 |
- convnext-audio
|
8 |
- audioset
|
|
|
|
|
|
|
|
|
|
|
9 |
---
|
10 |
+
|
11 |
+
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set.
|
12 |
|
13 |
The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
|
14 |
It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
|
|
|
19 |
|
20 |
This code is based on our repo: https://github.com/topel/audioset-convnext-inf
|
21 |
|
|
|
|
|
22 |
|
23 |
```bash
|
24 |
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
|
|
|
29 |
Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
|
30 |
|
31 |
```python
|
|
|
|
|
|
|
|
|
32 |
import os
|
33 |
import numpy as np
|
34 |
import torch
|
|
|
59 |
## Inference: get logits and probabilities
|
60 |
|
61 |
```python
|
|
|
62 |
sample_rate = 32000
|
63 |
audio_target_length = 10 * sample_rate # 10 s
|
64 |
|
|
|
129 |
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
|
130 |
```
|
131 |
|
132 |
+
# Zenodo
|
133 |
+
|
134 |
+
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
|
135 |
+
|
136 |
+
Together with a second checkpoint: convnext_tiny_465mAP_BL_AC_70kit.pth
|
137 |
+
|
138 |
+
The second model is useful to perform audio captioning on the AudioCaps dataset without training data biases. It was trained the same way as the current model, for audio tagging on AudioSet, but the files from AudioCaps were removed from the AudioSet development set.
|
139 |
+
|
140 |
|
141 |
+
# Citation
|
142 |
|
143 |
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
|
144 |
|