SR SSL FlowMatching 16kHz 430M
Model Overview
Description
This is a generative speech restoration model based on flow matching. The model is pre-trained on a publicly available Libri-Light dataset by using self-supervised learning technique. The model can be finetuned on various speech restoration tasks, such as speech denoising, bandwidth extraction, and codec artifact removal for human or machine listeners.
This model is for research and development only.
License/Terms of Use
License to use this model is covered by the CC-BY-NC-SA-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-SA-4.0 license.
References
[1] Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration, 2024.
Model Architecture
Architecture Type: Conditional Flow Matching
Network Architecture: Transformer
Input
Input Type(s): Audio
Input Format(s): .wav files
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: 16000 Hz Mono-channel Audio
Output
Output Type(s): Audio
Output Format: .wav files
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: 16000 Hz Mono-channel Audio
Software Integration
Runtime Engine(s):
- NeMo-2.0.0
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Turing
- NVIDIA Volta
Preferred Operating System(s)
- Linux
- Windows
Model Version(s)
sr_ssl_flowmatching_16k_430m_v1.0
Training, Testing, and Evaluation Datasets
Training Dataset
Link: Libri-Light
Data Collection Method by dataset: Human
Labeling Method by dataset: Not Applicable
Properties (Quantity, Dataset Descriptions, Sensor(s)):
Approximately 60k hours of English speech data
Testing Dataset
Link: Not Applicable
Evaluation Dataset
Link: Not applicable
Inference
Engine: NeMo 2.0
Test Hardware: NVIDIA H100
How to use this model
The model is available for use in the NVIDIA NeMo toolkit, and can be used fine-tuning on various speech tasks.
Load the model
from nemo.collections.audio.models import AudioToAudioModel
model = AudioToAudioModel.from_pretrained('nvidia/sr_ssl_flowmatching_16k_430m')
Change sampler configuration
model.sampler.num_steps = 20 # default is 50 steps
Finetuning
For finetuning, use init_from_nemo_model
to provide a path to a local NeMo model or init_from_pretrained_model
to download a pretrained NeMo model.
For example, use the following in finetuning configuration
init_from_pretrained_model: sr_ssl_flowmatching_16k_430m
An example of a finetuning configuration can be found in NeMo.
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.