SR SSL FlowMatching 16kHz 430M

Model Overview

Description

This is a generative speech restoration model based on flow matching. The model is pre-trained on a publicly available Libri-Light dataset by using self-supervised learning technique. The model can be finetuned on various speech restoration tasks, such as speech denoising, bandwidth extraction, and codec artifact removal for human or machine listeners.

This model is for research and development only.

License/Terms of Use

License to use this model is covered by the CC-BY-NC-SA-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-SA-4.0 license.

References

[1] Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration, 2024.

Model Architecture

Architecture Type: Conditional Flow Matching
Network Architecture: Transformer

Input

Input Type(s): Audio
Input Format(s): .wav files
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: 16000 Hz Mono-channel Audio

Output

Output Type(s): Audio
Output Format: .wav files
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: 16000 Hz Mono-channel Audio

Software Integration

Runtime Engine(s):

NeMo-2.0.0

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Turing
NVIDIA Volta

Preferred Operating System(s)

Linux
Windows

Model Version(s)

sr_ssl_flowmatching_16k_430m_v1.0

Training, Testing, and Evaluation Datasets

Training Dataset

Link: Libri-Light

Data Collection Method by dataset: Human

Labeling Method by dataset: Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): Approximately 60k hours of English speech data

Testing Dataset

Link: Not Applicable

Evaluation Dataset

Link: Not applicable

Inference

Engine: NeMo 2.0

Test Hardware: NVIDIA H100

How to use this model

The model is available for use in the NVIDIA NeMo toolkit, and can be used fine-tuning on various speech tasks.

Load the model

from nemo.collections.audio.models import AudioToAudioModel
model = AudioToAudioModel.from_pretrained('nvidia/sr_ssl_flowmatching_16k_430m')

Change sampler configuration

model.sampler.num_steps = 20 # default is 50 steps

Finetuning

For finetuning, use init_from_nemo_model to provide a path to a local NeMo model or init_from_pretrained_model to download a pretrained NeMo model. For example, use the following in finetuning configuration

init_from_pretrained_model: sr_ssl_flowmatching_16k_430m

An example of a finetuning configuration can be found in NeMo.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.