steveheh commited on
Commit
014babb
1 Parent(s): 2707bd2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +303 -3
README.md CHANGED
@@ -1,3 +1,303 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ library_name: NeMo
6
+ tags:
7
+ - Self-supervised Learning
8
+ - Conformer
9
+ - NeMo
10
+ - speech
11
+ - audio
12
+ ---
13
+
14
+ # Model Overview
15
+ ## Description:
16
+ The NEST framework is designed for speech self-supervised learning, which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks. The NEST-L model has about 115M parameters and is trained on an English dataset of roughly 100K hours. <br>
17
+ This model is ready for commercial/non-commercial use. <br>
18
+
19
+ ### License:
20
+ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
21
+
22
+ ## Reference:
23
+ [1] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106) <br>
24
+ [2] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) <br>
25
+ [3] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279) <br>
26
+ [4] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://arxiv.org/abs/2406.19674) <br>
27
+ [5] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656) <br>
28
+ [6] [Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling](https://arxiv.org/abs/2307.07057)<br>
29
+
30
+ ## Model Architecture:
31
+
32
+ **Architecture Type:** NEST [1] <br>
33
+
34
+ **Network Architecture:**
35
+ - Encoder: FastConformer (18 layers)
36
+ - Decoder: Linear classifier
37
+ - Masking: Random block masking
38
+ - Augmentor: Speaker/noise augmentation
39
+ - Loss: Cross-entropy on masked positions <br>
40
+
41
+ ## Input:
42
+ **Input Type(s):** Audio <br>
43
+ **Input Format(s):** wav files <br>
44
+ **Input Parameters:** One-Dimensional (1D) <br>
45
+ **Other Properties Related to Input:** 16000 Hz Mono-channel Audio <br>
46
+
47
+ ## Output:
48
+ **Output Type(s):** Audio features <br>
49
+ **Output Format:** Audio embeddings <br>
50
+ **Output Parameters:** Feature sequence (2D) <br>
51
+ **Other Properties Related to Output:** Audio feature sequence of shape [D,T] <br>
52
+
53
+
54
+ ## Model Version(s):
55
+ `ssl_en_nest_large_v1.0` <br>
56
+
57
+
58
+ ## How to Use the Model:
59
+ The model is available for use in the NVIDIA NeMo Framework [2], and can be used as weight initialization for downstream tasks or as a frozen feature extractor.
60
+ ### Loading the whole model:
61
+ ```python
62
+ from nemo.collections.asr.models import EncDecDenoiseMaskedTokenPredModel
63
+ nest_model = EncDecDenoiseMaskedTokenPredModel.from_pretrained(model_name="nvidia/ssl_en_nest_large_v1.0")
64
+ ```
65
+ ### Using NEST encoder as weight initialization for downstream tasks:
66
+ ```bash
67
+ # use ASR as example:
68
+ python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
69
+ # (Optional: --config-path=<path to dir of configs> --config-name=<name of config without .yaml>) \
70
+ ++init_from_pretrained.name="nvidia/ssl_en_nest_large_v1.0" \
71
+ ++init_from_pretrained.include=["encoder"] \
72
+ model.train_ds.manifest_filepath=<path to train manifest> \
73
+ model.validation_ds.manifest_filepath=<path to val/test manifest> \
74
+ model.tokenizer.dir=<path to directory of tokenizer (not full path to the vocab file!)> \
75
+ model.tokenizer.type=<either bpe or wpe> \
76
+ trainer.devices=-1 \
77
+ trainer.accelerator="gpu" \
78
+ trainer.strategy="ddp" \
79
+ trainer.max_epochs=100 \
80
+ model.optim.name="adamw" \
81
+ model.optim.lr=0.001 \
82
+ model.optim.betas=[0.9,0.999] \
83
+ model.optim.weight_decay=0.0001 \
84
+ model.optim.sched.warmup_steps=2000
85
+ exp_manager.create_wandb_logger=True \
86
+ exp_manager.wandb_logger_kwargs.name="<Name of experiment>" \
87
+ exp_manager.wandb_logger_kwargs.project="<Name of project>"
88
+ ```
89
+ More details can be found at [maybe_init_from_pretrained_checkpoint()](https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py#L1236).
90
+ ### Using NEST as a frozen feature extractor:
91
+ NEST can also be used as a frozen feature extractor for downstream tasks. For example, in the case of speaker verification, embeddings can be extracted from different layers of the NEST model, and a learned weighted combination of those embeddings can be used as input to the speaker verification model.
92
+ Please refer to this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/downstream/speech_classification_mfa_train.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/multi_layer_feat/nest_titanet_small.yaml) for details.
93
+
94
+ ### Extracting audio features from NEST
95
+
96
+ NEST supports extracting audio features from multiple layers of its encoder:
97
+ ```
98
+ python <NeMo RooT>/scripts/ssl/extract_features.py \
99
+ --model_path="nvidia/ssl_en_nest_large_v1.0" \
100
+ --input=<path to input manifest, or a dir containing audios, or path to audio> \
101
+ --output=<output directory to store features and manifest> \
102
+ --layers="all" \
103
+ --batch_size=8 \
104
+ --workers=8
105
+ ```
106
+
107
+ ## Training
108
+ The [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) [2] was used for training the model for two hundred epochs. Model is trained with this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/masked_token_pred_pretrain.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/nest_fast-conformer.yaml).
109
+ ## Training Datasets
110
+ - [LibriLight](https://github.com/facebookresearch/libri-light)
111
+ - Data Collection Method: Human
112
+ - Labeling Method: Human
113
+ - [Voxpopuli](https://github.com/facebookresearch/voxpopuli)
114
+ - Data Collection Method: Human
115
+ - Labeling Method: Human
116
+ - NeMo ASR Set 3.0
117
+ - Data Collection Method: Hybrid: Automated, Human
118
+ - Labeling Method: Hybrid: Automated, Human
119
+ <br>
120
+
121
+ ## Inference:
122
+ **Engine:** NVIDIA NeMo <br>
123
+ **Test Hardware:** <br>
124
+ * A6000 <br>
125
+ * A100 <br>
126
+
127
+
128
+ ## Performance
129
+
130
+ For performance on more tasks, please refer to the NEST paper [1].
131
+
132
+ ### Multi-lingual Speech Recognition (ASR) with Punctuation and Capitalization
133
+
134
+ We finetuned the NEST model on 14k hours of multilingual (En, De, Es, FR) ASR data using the hybrid-CTC-RNNT loss [3] and evaluate the model's **word error rate (WER) with punctuation and capitalization** on the MCV16.1 test set. Please refer to the NEST paper [1] for more results and details on the model and training setup.
135
+
136
+ Model | En-MCV16.1-test | De-MCV16.1-test | Es-MCV16.1-test | Fr-MCV16.1-test
137
+ :----:|:---------------:|:---------------:|:---------------:|:---------------:
138
+ ssl_en_nest_xlarge | 14.43 | 8.07 | 8.70 | 16.18
139
+
140
+
141
+ ### Speech-to-text Translation (AST)
142
+
143
+ We use the `stt_en_nest_xlarge` model to initialize the Canary [4] model for speech-to-text translation. We evaluate the model's **BLEU score** on FLEURS test sets. Please refer to the NEST paper [1] for more results and details on the model and training setup.
144
+
145
+ Model | En->De | En->Es | En->Fr
146
+ :----:|:-----:|:-----:|:-----:
147
+ ssl_en_nest_xlarge| 29.50 | 22.61 | 39.27
148
+
149
+
150
+ ### Speaker Diarization (SD)
151
+
152
+ We use the `ssl_en_nest_large_v1.0` model to initialize the Sortformer [5] model for speaker diarization. We evaluate the model's **diarization error rate (DER)** on the DIHARD and CALLHOME-part2 test sets. Please refer to the Sortformer paper [5] for more results and details on the model and training setup.
153
+
154
+ Model | DIHARD | CALLHOME-part2 | CALLHOME-part2 | CALLHOME-part2
155
+ :----:|:-----:|:-----:|:-----:|:-----:
156
+ [speakers] | <= 4 | 2 | 3 | 4
157
+ [collar] | collar=0.0 | collar=0.25 | collar=0.25 | collar=0.25
158
+ Sortformer w/ NEST | 14.60 | 6.08 | 9.57 | 15.40
159
+
160
+
161
+ ### Speech Intent Classification and Slot Filling (SLU)
162
+
163
+ We use the `ssl_en_nest_large_v1.0` model to initialize the SLU model for speech intent classification and slot filling. We evaluate the model's **intent classification accuracy** and **SLURP F1 score** on the SLURP test set. Please refer to the NEST paper [1] for more results and details on the model and training setup.
164
+
165
+ Model | Intent Acc | SLURP F1
166
+ :----:|:---------:|:-------:
167
+ ssl_en_nest_large_v1.0 | 89.79 | 79.61
168
+ ssl_en_nest_xlarge_v1.0 | 89.04 | 80.31
169
+
170
+ ## Software Integration:
171
+
172
+ **Runtime Engine(s):**
173
+ * [NeMo-2.0] <br>
174
+
175
+ **Supported Hardware Microarchitecture Compatibility:** <br>
176
+ * [NVIDIA Ampere] <br>
177
+ * [NVIDIA Blackwell] <br>
178
+ * [NVIDIA Jetson] <br>
179
+ * [NVIDIA Hopper] <br>
180
+ * [NVIDIA Lovelace] <br>
181
+ * [NVIDIA Pascal] <br>
182
+ * [NVIDIA Turing] <br>
183
+ * [NVIDIA Volta] <br>
184
+
185
+ **Supported Operating System(s):** <br>
186
+ * [Linux] <br>
187
+ * [Windows] <br>
188
+
189
+
190
+ ## Ethical Considerations:
191
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
192
+
193
+ For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here].
194
+
195
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
196
+
197
+
198
+
199
+
200
+ ## Bias
201
+
202
+ Field | Response
203
+ :---------------------------------------------------------------------------------------------------|:---------------
204
+ Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None
205
+ Measures taken to mitigate against unwanted bias: | None
206
+
207
+
208
+ ## Explainability
209
+
210
+ Field | Response
211
+ :------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
212
+ Intended Application & Domain: | Model initialization or feature extractor for downstream speech processing tasks
213
+ Model Type: | Transformer
214
+ Intended Users: | Researchers and Developers in speech processing
215
+ Output: | Audio embeddings
216
+ Describe how the model works: | Speech signal is processed by the model to produce audio embeddings
217
+ Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable
218
+ Technical Limitations: | This model was trained on English speech data and may not generalize well to other languages. Although the model was trained with various audio lengths from 1 second to 64 seconds, it may not perform well in streaming situations.
219
+
220
+ Verified to have met prescribed NVIDIA quality standards: | Yes
221
+ Performance Metrics: | Accuracy, F1, WER, DER
222
+ Potential Known Risks: | Speech features might not be effective for unseen languages and non-speech signals
223
+ Licensing: | [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
224
+
225
+ ## Privacy
226
+
227
+ Field | Response
228
+ :----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
229
+ Generatable or reverse engineerable personal data? | None
230
+ Personal data used to create this model? | None
231
+ Was consent obtained for any personal data used? | Not Applicable
232
+ How often is dataset reviewed? | Before Release
233
+ Is a mechanism in place to honor data subject right of access or deletion of personal data? | Not Applicable
234
+ If personal data was collected for the development of the model, was it collected directly by NVIDIA? | Not Applicable
235
+ If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Not Applicable
236
+ If personal data was collected for the development of this AI model, was it minimized to only what was required? | Not Applicable
237
+ Is there provenance for all datasets used in training? | Yes
238
+ Does data labeling (annotation, metadata) comply with privacy laws? | Yes
239
+ Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data.
240
+
241
+
242
+ ## Safety
243
+
244
+ Field | Response
245
+ :---------------------------------------------------|:----------------------------------
246
+ Model Application(s): | Model initialization or feature extractor for downstream speech processing tasks
247
+ Describe the life critical impact (if present). | Not Applicable
248
+ Use Case Restrictions: | Abide by [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
249
+ Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
250
+
251
+
252
+
253
+
254
+
255
+ ## Bias
256
+
257
+ Field | Response
258
+ :---------------------------------------------------------------------------------------------------|:---------------
259
+ Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None
260
+ Measures taken to mitigate against unwanted bias: | None
261
+
262
+
263
+ ## Explainability
264
+
265
+ Field | Response
266
+ :------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
267
+ Intended Application & Domain: | Model initialization or feature extractor for downstream speech processing tasks
268
+ Model Type: | Transformer
269
+ Intended Users: | Researchers and Developers in speech processing
270
+ Output: | Audio embeddings
271
+ Describe how the model works: | Speech signal is processed by the model to produce audio embeddings
272
+ Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable
273
+ Technical Limitations: | None
274
+ Verified to have met prescribed NVIDIA quality standards: | Yes
275
+ Performance Metrics: | N/A
276
+ Potential Known Risks: | None Known
277
+ Licensing: | [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
278
+
279
+ ## Privacy
280
+
281
+ Field | Response
282
+ :----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
283
+ Generatable or reverse engineerable personal data? | No
284
+ Personal data used to create this model? | No
285
+ Was consent obtained for any personal data used? | Not Applicable
286
+ How often is dataset reviewed? | Before Release
287
+ Is a mechanism in place to honor data subject right of access or deletion of personal data? | Not Applicable
288
+ If personal data was collected for the development of the model, was it collected directly by NVIDIA? | Not Applicable
289
+ If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Not Applicable
290
+ If personal data was collected for the development of this AI model, was it minimized to only what was required? | Not Applicable
291
+ Is there provenance for all datasets used in training? | Yes
292
+ Does data labeling (annotation, metadata) comply with privacy laws? | Yes
293
+ Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data.
294
+
295
+
296
+ ## Safety
297
+
298
+ Field | Response
299
+ :---------------------------------------------------|:----------------------------------
300
+ Model Application(s): | Model initialization or feature extractor for downstream speech processing tasks
301
+ Describe the life critical impact (if present). | Not Applicable
302
+ Use Case Restrictions: | Model for commercial and non-commercial use
303
+ Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.