metadata

license: cc-by-4.0
datasets:
  - benjamin-paine/hey-buddy
  - benjamin-paine/freesound-laion-640k-commercial-16khz-full
  - benjamin-paine/free-music-archive-commercial-16khz-full
  - benjamin-paine/mit-impulse-response-survey-16khz
language:
  - en
tags:
  - wake word
  - keyword spotter
  - wakeword

Hey Buddy! is a library for training wake word models (a.k.a audio keyword spotters) and deploying them to the browser for real-time use on CPU or GPU.

Using a wake-word as a gating mechanism for voice-enabled web applications carries numerous benefits, including reduced power consumption, improved privacy, and enhanced performance in noisy environments over speech-to-text systems.

Designed for usability and flexibility, Hey Buddy! features the following:

Available with a simple pip install heybuddy
Use a single command to train a model, no downloads or data gathering required
Models are resilient to noisy environments and hurried speech
Verified for commercial use - uses open data and libraries with commercially-viable licenses (see below for specifics)
Process an arbitrary number of Wake Word models at once, with plenty of frame budget left over
Voice-command-oriented JavaScript module provides a recording of the audio from the beginning of the wake word to the end of speech, ready to be dispatched for downstream processing
Callback interface for more advanced integrations

Demo

A live demo space is available here. As this runs entirely on-device, this will not consume any HuggingFace credits.

Installation

The Python library only works on Linux at the moment, due to piper-phonemize only working on Linux.

The easiest way to install and use Hey Buddy is with Anaconda/Miniconda and the maintained environment file like so:

wget https://raw.githubusercontent.com/painebenjamin/hey-buddy/refs/heads/main/environment.yml
conda env create -f environment.yml
conda activate heybuddy
pip install heybuddy[gpu]

If you already have a working CUDA environment, you can just skip to pip install heybuddy. You will need to install piper-phonemize separately in this case, see the rhasspy/piper-phonemizer releases page on GitHub to find the latest wheel for your python version - for example, the environment file uses python 3.10, so it installs https://github.com/rhasspy/piper-phonemize/releases/download/v1.1.0/piper_phonemize-1.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.

Training Models

Training a model is as simple as executing heybuddy train:

heybuddy train "hello world"

This will:

Generate 100,000 training speech samples of the wake phrase using Piper TTS
Generate 100,000 adversarial training speech samples of phonetically-similar phrases using Piper TTS
Augment the wake word text with common follow-up words for conversational or command-oriented speech
Augment generated speech samples using sound effects and music in varying ratios of signal-to-noise
Reverberate these speech samples using varying impule responses (IRs)
Undergo any combination of additional distortions including white noise, pitch shifting, and more.
Extract speech embeddings from both sets of samples, storing as memory-mapped numpy arrays for quick access during training
Repeat the above process is repeated to generate testing samples and validation* samples
Download the precalculated training and validation datasets if not already downloaded
Train the model in an automated fashion in three stages
Save the model's checkpoint for conversion to ONNX or resuming training at a later time.

* Validation is slightly different from testing in that the positive samples are not augmented with noise, and thus serve as a 'best-case' speech validation that should be close to 100% accurate. Validation, additionally, does not include adversarial samples.

Finally, a convenient script is available for converting the .pt checkpoint to a .onnx model for the web:

heybuddy convert checkpoints/hello_world_final.pt

You can now serve this model on the web. See the JavaScript API documentation below.

Example Generated Audio

When generating new datasets, an example .wav file will be generated in your current working directory for you to listen to. Here are some examples of these:

Text	TTS Output	Augmented TTS Output	Adversarial (Sound-Alike) TTS Output	Augmented Adversarial TTS Output
Buddy
Hey Buddy
Hi Buddy
Yo Buddy
Sup Buddy
Okay Buddy

Precalculated Datasets

Training and validation datasets are automatically downloaded when using the command-line interface with default options. See here for more information about these datasets and how you can make your own.

If you have space limitations, reference the table below for command-line flags that can help reduce size.

Flag	Size	Default
`--training-full-default-dataset`	72 GB	Yes
`--training-large-default-dataset`	46 GB	No
`--training-medium-default-dataset`	25 GB	No
`--training-no-default-dataset`	0	No

Note when using no default dataset, you will need to provide your own dataset with --training-dataset <path>. See the link above for how to create your own.

Augmentation Datasets

These are the default augmentation datasets. These were selected for their commercial availability and lack of copyleft restrictions so your trained wake-word models are your own. If you substitute another dataset, make sure you have the required license to use that data for AI training.

benjamin-paine/mit-impulse-response-survey-16khz (CC-BY)
benjamin-paine/freesound-laion-640k-commercial-16khz-full (CC0, CC-BY, CC-Sampling)
benjamin-paine/free-music-archive-commercial-16khz-full (CC0, CC-BY, CC-Sampling+, FMA Sound Recording Common Law, Free Art License, Public Domain)

When using the command-line, pass --augmentation-dataset-streaming to stream datasets instead of downloading them. This is not recommended for performance.

To disable using these datasets, pass --augmentation-no-default-background-dataset and --augmentation-no-default-impulse-dataset. You should pass your own datasets in the format <username>/<repo>, for loading via load_dataset

CLI Options

Usage: heybuddy train [OPTIONS] PHRASE                         
                                                                                   
  Trains a wake word detection model.                                              
                                                                                                                                                                       
Options:                                                                           
  --additional-phrase TEXT        Additional phrases to use for training.  
  --wandb-entity TEXT             W&B entity to use for logging.           
  --layer-dim INTEGER             Dimension of the linear layers to use for   
                                  the model.  [default: 96]                   
  --num-layers INTEGER            The number of perceptron blocks.  [default:
                                  2]                                               
  --num-heads INTEGER             The number of attention heads to use when        
                                  using the transformer model.  [default: 1]       
  --steps INTEGER                 Number of optimization steps to take.      
                                  [default: 12500]                          
  --stages INTEGER                Number of training stages.  [default: 3]  

  --learning-rate FLOAT           Learning rate for the optimizer.  [default:
                                  0.001]                                           
  --high-loss-threshold FLOAT     Threshold for high loss values (e.g. with  
                                  the default 0.001, a value is high-loss if
                                  it's supposed to be 0 and is higher than  
                                  0.001 or supposed to be 1 and is lower than
                                  0.999)  [default: 0.0001]       
  --target-false-positive-rate FLOAT                                               
                                  Target false positive rate for the model.
                                  [default: 0.5]                              
  --dynamic-negative-weight / --no-dynamic-negative-weight                         
                                  Dynamically adjust the negative weight based
                                  on target false positive rate at each       
                                  validation step (instead of in-between
                                  stages.)  [default: dynamic-negative-weight]
  --negative-weight FLOAT         Negative weight for the loss function.     
                                  [default: 1.0]
  --training-full-default-dataset                                                  
                                  Use the full precalculated default training
                                  set.  [default: full]
  --training-large-default-dataset                                                 
                                  Use the large precalculated default training
                                  set.  [default: full]
  --training-medium-default-dataset                                                
                                  Use the medium precalculated default        
                                  training set.  [default: full]
  --training-no-default-dataset   Do not use a precalculated default training 
                                  set.  [default: full]                       
  --training-dataset FILE         Use a custom precalculated training set.
  --augment-phrase-prob FLOAT     Probability of augmenting the phrase.  
                                  [default: 0.75]                            
  --augment-phrase-default-words / --augment-phrase-no-default-words
                                  Use the default words for augmentation.
                                  [default: augment-phrase-default-words]
  --augment-phrase-word TEXT      Custom words to use for augmentation.
  --augmentation-default-background-dataset / --augmentation-no-default-background-dataset
                                  Use the default background dataset for
                                  augmentation.  [default: augmentation-   
                                  default-background-dataset]              
  --augmentation-background-dataset TEXT                                           
                                  Use a custom background dataset for
                                  augmentation.                               
  --augmentation-default-impulse-dataset / --augmentation-no-default-impulse-dataset
                                  Use the default impulse dataset for      
                                  augmentation.  [default: augmentation-
                                  default-impulse-dataset]                    
  --augmentation-impulse-dataset TEXT                                              
                                  Use a custom impulse dataset for                 
                                  augmentation.                     
  --augmentation-dataset-streaming / --augmentation-dataset-no-streaming      
                                  Stream the augmentation datasets, instead of
                                  downloading first.  [default: augmentation-
                                  dataset-no-streaming]
  --augmentation-seven-band-prob FLOAT                                             
                                  Probability of applying the seven band           
                                  equalization augmentation.  [default: 0.25]
  --augmentation-seven-band-gain-db FLOAT                                   
                                  Gain in decibels for the seven band       
                                  equalization augmentation.  [default: 6.0]

  --augmentation-tanh-distortion-prob FLOAT                                 
                                  Probability of applying the tanh distortion
                                  augmentation.  [default: 0.25]          
  --augmentation-tanh-distortion-min FLOAT                                  
                                  Minimum value for the tanh distortion
                                  augmentation.  [default: 0.0001]
  --augmentation-tanh-distortion-max FLOAT                             
                                  Maximum value for the tanh distortion
                                  augmentation.  [default: 0.1]               
  --augmentation-pitch-shift-prob FLOAT                                            
                                  Probability of applying the pitch shift     
                                  augmentation.  [default: 0.25]              
  --augmentation-pitch-shift-semitones INTEGER
                                  Number of semitones to shift the pitch for  
                                  the pitch shift augmentation.  [default: 3]
  --augmentation-band-stop-prob FLOAT
                                  Probability of applying the band stop filter
                                  augmentation.  [default: 0.25]  
  --augmentation-colored-noise-prob FLOAT 
                                  Probability of applying the colored noise   
                                  augmentation.  [default: 0.25]              
  --augmentation-colored-noise-min-snr-db FLOAT
                                  Minimum signal-to-noise ratio for the       
                                  colored noise augmentation.  [default: 10.0]
  --augmentation-colored-noise-max-snr-db FLOAT
                                  Maximum signal-to-noise ratio for the       
                                  colored noise augmentation.  [default: 30.0]
  --augmentation-colored-noise-min-f-decay FLOAT           
                                  Minimum frequency decay for the colored
                                  noise augmentation.  [default: -1.0]       
  --augmentation-colored-noise-max-f-decay FLOAT                   
                                  Maximum frequency decay for the colored
                                  noise augmentation.  [default: 2.0]
  --augmentation-background-noise-prob FLOAT                           
                                  Probability of applying the background noise
                                  augmentation.  [default: 0.75]
  --augmentation-background-noise-min-snr-db FLOAT                         
                                  Minimum signal-to-noise ratio for the    
                                  background noise augmentation.  [default:
                                  -10.0]
  --augmentation-background-noise-max-snr-db FLOAT                            
                                  Maximum signal-to-noise ratio for the    
                                  background noise augmentation.  [default:
                                  15.0]
  --augmentation-gain-prob FLOAT  Probability of applying the gain            
                                  augmentation.  [default: 1.0]             
  --augmentation-reverb-prob FLOAT                                                 
                                  Probability of applying the reverb
                                  augmentation.  [default: 0.75]              
  --logging-steps INTEGER         How often to log step details.  [default: 1]
  --validation-steps INTEGER      How often to validate the model.  [default:
                                  250]
  --checkpoint-steps INTEGER      How often to save the model.  [default:    
                                  5000]                                            
  --positive-samples INTEGER      Number of positive samples to use for    
                                  training. Will synthetically generate more
                                  when needed.  [default: 100000]       
  --adversarial-samples INTEGER   Number of adversarial samples to use for
                                  training. Will synthetically generate more
                                  when needed.  [default: 100000]
  --adversarial-phrases INTEGER   Number of adversarial phrases to use for
                                  training. Will synthetically generate more
                                  when needed.  [default: 250]
  --adversarial-phrase-custom TEXT
                                  Custom adversarial phrases to use for
                                  training.
  --positive-batch-size INTEGER   The number of positive samples to include in
                                  each batch during training.  [default: 50]
  --negative-batch-size INTEGER   The number of negative samples to include in
                                  each batch during training.  [default: 1000]
  --adversarial-batch-size INTEGER
                                  The number of adversarial samples to include
                                  in each batch during training.  [default:
                                  50]
  --num-batch-threads INTEGER     The number of threads to spawn for creating
                                  training batches.  [default: 12]
  --validation-positive-batch-size INTEGER
                                  The number of positive samples to include in
                                  each batch during validation.  [default: 50]
  --validation-negative-batch-size INTEGER
                                  The number of negative samples to include in
                                  each batch during validation.  [default:
                                  1000]
  --validation-samples INTEGER    The number of samples to use for validation.
                                  Will synthetically generate more when
                                  needed.  [default: 25000]
  --validation-num-batch-threads INTEGER
                                  The number of threads to spawn for creating
                                  validation batches.  [default: 1]
  --validation-default-dataset / --validation-no-default-dataset
                                  Use the default validation dataset.
                                  [default: validation-default-dataset]
  --validation-dataset FILE       Use a custom precalculated validation set.
  --testing-positive-samples INTEGER
                                  The number of positive samples to use for
                                  testing. Will synthetically generate more
                                  when needed.  [default: 25000]
  --testing-adversarial-samples INTEGER
                                  The number of adversarial samples to use for
                                  testing. Will synthetically generate more
                                  when needed.  [default: 25000]
  --testing-positive-batch-size INTEGER
                                  The number of positive samples to include in
                                  each batch during testing. Default matches
                                  the size used during training.
  --testing-adversarial-batch-size INTEGER
                                  The number of adversarial samples to include
                                  in each batch during testing. Default
                                  matches the size used during training.
  --testing-num-batch-threads INTEGER
                                  The number of threads to spawn for creating
                                  testing batches.  [default: 1]
  --resume / --no-resume          Resume training from the last checkpoint.
                                  [default: no-resume]
  --debug / --no-debug            Enable debug logging.  [default: no-debug]
  --help                          Show this message and exit.

Evaluating Models

A number of evaluation metrics are recorded during training. The best way to access this data is to use Weights & Biases, then pass your entity (username or team name) to the training script via --wandb-entity <name>.

Training metrics from Weights & Biases. Click to enlarge.

Inference

JavaScript API

Hey Buddy! is distributed as two JavaScript files that should be made available through any implementing web application:

hey-buddy.min.js (90 kB) which provides the HeyBuddy API, and
hey-buddy-worklet.js (1 kB) which implements the Audio Worklet interface which will be imported by the browser using the AudioWorklet API.

You need to make the ONNX runtime available prior to importing the HeyBuddy JS API. The easiest way to do this is to use a JS CDN, like so:

<!-- ONNX Runtime -->
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.19.0/dist/ort.min.js"></script>
<!-- Hey Buddy API -->
<script src="/path/to/my/website/hey-buddy.min.js"></script>
<!-- Your Code -->
<script>...</script>

You will also need to host the onnx file created by heybuddy convert (see above.)

Your implementation code could look something like this:

const options = {
    record: true, // disable when not needed to save memory
    modelPath: ["/models/my-model.onnx", "/models/my-other-model.onnx"], // Also accepts single strings
    // workletUrl: /path/to/my/website/hey-buddy-worklet.js // Only needed if not at the same path as the library code
};

const heyBuddy = new HeyBuddy(options);

// do something with a recording
heyBuddy.onRecording(async (audio) => {
    // audio is a Float32Array of samples
});

// do something every frame
heyBuddy.onProcessed((result) => {
    /**
     * this is triggered every frame with:
     * {
     *    listening: bool,
     *    recording: bool,
     *    speech: {
     *        probability: 0.0 <= float <= 1.0,
     *        active: bool
     *    },
     *    wakeWords: {
     *       $modelNameWithoutExtension: {
     *           probability: 0.0 <= float <= 1.0,
     *           active: bool
     *       }
     *    }
     * }
     */
});

Python API

To use the torch model directly, you can do the following.

from heybuddy import WakeWord

audio = "/path/to/audio.wav" # OR flac OR mp3 OR numpy array/tensor (int16 or float32) OR list of the same
model = WakeWord.from_file("/path/to/model.pt")
model.to("cuda") # optional
predictions = model.predict(
    audio,
    threshold=0.5, # you can experiment with different thresholds
    return_scores=False, # when true, return the prediction scores instead of bools. default false.
)
first_audio_has_wakeword = predictions[0] # bool by default

A threaded API is also available for ease-of-use and to minimize performance impact.

from heybuddy import WakeWordModelThread

audio = "/path/to/audio.wav" # OR flac OR mp3 OR numpy array/tensor (int16 or float32) OR list of the same
thread = WakeWordModelThread(
    "/path/to/model.pt", # or onnx
    device_id=None, # set to GPU index to use CUDA for torch or onnx, otherwise runs on CPU
    threshold=0.5, # you can experiment with different thresholds
    return_scores=False, # when true, return the prediction scores instead of bools. default false.
)

thread.put(audio)
predictions = thread.get( # same arguments as queue.Queue.get()
    block=True,
    timeout=None
)
first_audio_has_wakeword = predictions[0] # bool by default

Errata

These are some potentially useful JavaScript snippets for working with raw audio samples.

/**
 * Play audio samples using the Web Audio API.
 * @param {Float32Array} audioSamples - The audio samples to play.
 * @param {number} sampleRate - The sample rate of the audio samples.
 */
function playAudioSamples(audioSamples, sampleRate = 16000) {
    // Create an AudioContext
    const audioContext = new (window.AudioContext || window.webkitAudioContext)();

    // Create an AudioBuffer
    const audioBuffer = audioContext.createBuffer(
        1, // number of channels
        audioSamples.length, // length of the buffer in samples
        sampleRate // sample rate (samples per second)
    );

    // Fill the AudioBuffer with the Float32Array of audio samples
    audioBuffer.getChannelData(0).set(audioSamples);

    // Create a BufferSource node
    const source = audioContext.createBufferSource();
    source.buffer = audioBuffer;

    // Connect the source to the AudioContext's destination (the speakers)
    source.connect(audioContext.destination);

    // Start playback
    source.start();
};

/**
 * Turns floating-point audio samples to a Wave blob.
 * @param {Float32Array} audioSamples - The audio samples to play.
 * @param {number} sampleRate - The sample rate of the audio samples.
 * @param {number} numChannels - The number of channels in the audio. Defaults to 1 (mono).
 * @return {Blob} A blob of type `audio/wav`
 */
function samplesToBlob(audioSamples, sampleRate = 16000, numChannels = 1) {
    // Helper to write a string to the DataView
    const writeString = (view, offset, string) => {
        for (let i = 0; i < string.length; i++) {
            view.setUint8(offset + i, string.charCodeAt(i));
        }
    };  

    // Helper to convert Float32Array to Int16Array (16-bit PCM)
    const floatTo16BitPCM = (output, offset, input) => {
        for (let i = 0; i < input.length; i++, offset += 2) {
            let s = Math.max(-1, Math.min(1, input[i])); // Clamping to [-1, 1]
            output.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7FFF, true); // Convert to 16-bit PCM
        }
    };  

    const byteRate = sampleRate * numChannels * 2; // 16-bit PCM = 2 bytes per sample

    // Calculate sizes
    const blockAlign = numChannels * 2; // 2 bytes per sample for 16-bit audio
    const wavHeaderSize = 44; 
    const dataLength = audioSamples.length * numChannels * 2; // 16-bit PCM data length
    const buffer = new ArrayBuffer(wavHeaderSize + dataLength);
    const view = new DataView(buffer);

    // Write WAV file headers
    writeString(view, 0, 'RIFF'); // ChunkID
    view.setUint32(4, 36 + dataLength, true); // ChunkSize
    writeString(view, 8, 'WAVE'); // Format
    writeString(view, 12, 'fmt '); // Subchunk1ID
    view.setUint32(16, 16, true); // Subchunk1Size (PCM = 16)
    view.setUint16(20, 1, true); // AudioFormat (PCM = 1)
    view.setUint16(22, numChannels, true); // NumChannels
    view.setUint32(24, sampleRate, true); // SampleRate
    view.setUint32(28, byteRate, true); // ByteRate
    view.setUint16(32, blockAlign, true); // BlockAlign
    view.setUint16(34, 16, true); // BitsPerSample (16-bit PCM)
    writeString(view, 36, 'data'); // Subchunk2ID
    view.setUint32(40, dataLength, true); // Subchunk2Size

    // Convert the Float32Array audio samples to 16-bit PCM and write them to the DataView
    floatTo16BitPCM(view, wavHeaderSize, audioSamples);

    // Create and return the Blob
    return new Blob([view], { type: 'audio/wav' }); 
}

/**
 * Renders a blob to an audio element with controls.
 * Use `appendChild(result)` to add to the document or a node.
 * @param {Blob} audioBlob - A blob with a valid audio type.
 * @see samplesToBlob
 */
function blobToAudio(audioBlob) {
    // Create data URL
    const url = URL.createObjectURL(audioBlob);

    // Create and configure audio element
    const audio = document.createElement("audio");
    audio.controls = true;
    audio.src = url;
    return audio;
}

/**
 * Downloads a blob as a file.
 * @param {Blob} blob - A blob with a type that can be converted to an object URL.
 * @param {string} filename - The file name after downloading.
 */
function downloadBlob(blob, filename) {
    // Create data URL
    const url = URL.createObjectURL(blob);

    // Create and configure link element
    const link = document.createElement("a");
    link.href = url;
    link.setAttribute("download", filename);
    link.style.display = "none";

    // Add the link to the page, click, then remove
    document.body.appendChild(link);
    window.requestAnimationFrame(() => {
        link.dispatchEvent(new MouseEvent("click"));
        document.body.removeChild(link);
    });
}

License

HeyBuddy source code and pretrained models are released under the Apache License 2.0.
Pre-calculated training and validation datasets are released under CC-BY 4.0. See the dataset card for individual licensing information.

Citations and Acknowledgements

David Scripka, OpenWakeWord (Apache 2.0 License)
Lin, Kilgour, Roblek, Sharifi et. Al, Speech Embeddings (Apache 2.0 License)
Rhasspy and the Open Home Foundation, Piper TTS (MIT License)
Silero Team, SileroVAD (MIT License)
Siebert, Adelman and Git, npy-apppend-array

@misc{lin2020trainingkeywordspotterslimited,
  title={Training Keyword Spotters with Limited and Synthesized Speech Data}, 
  author={James Lin and Kevin Kilgour and Dominik Roblek and Matthew Sharifi},
  year={2020},
  eprint={2002.01322},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2002.01322}, 
}

@inproceedings{50492,
  title={Improving Automatic Speech Recognition with Neural Embeddings},
  author={Christopher Li and Diamantino A. Caseiro and Leonid Velikovich and Pat Rondon and Petar S. Aleksic and Xavier L Velez},
  year={2021},
  address={111 8th AveNew York, NY 10011}
}

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

@inproceedings{fma_dataset,
  title = {{FMA}: A Dataset for Music Analysis},
  author = {Defferrard, Micha\"el and Benzi, Kirell and Vandergheynst, Pierre and Bresson, Xavier},
  booktitle = {18th International Society for Music Information Retrieval Conference (ISMIR)},
  year = {2017},
  archiveprefix = {arXiv},
  eprint = {1612.01840},
  url = {https://arxiv.org/abs/1612.01840},
}

@inproceedings{fma_challenge,
  title = {Learning to Recognize Musical Genre from Audio},
  subtitle = {Challenge Overview},
  author = {Defferrard, Micha\"el and Mohanty, Sharada P. and Carroll, Sean F. and Salath\'e, Marcel},
  booktitle = {The 2018 Web Conference Companion},
  year = {2018},
  publisher = {ACM Press},
  isbn = {9781450356404},
  doi = {10.1145/3184558.3192310},
  archiveprefix = {arXiv},
  eprint = {1803.05337},
  url = {https://arxiv.org/abs/1803.05337},
}

@article{
  doi:10.1073/pnas.1612524113,
  author = {James Traer and Josh H. McDermott},
  title = {Statistics of natural reverberation enable perceptual separation of sound and space},
  journal = {Proceedings of the National Academy of Sciences},
  volume = {113},
  number = {48},
  pages = {E7856-E7865},
  year = {2016},
  doi = {10.1073/pnas.1612524113},
  URL = {https://www.pnas.org/doi/abs/10.1073/pnas.1612524113},
  eprint = {https://www.pnas.org/doi/pdf/10.1073/pnas.1612524113},
  abstract = {Sounds produced in the world reflect off surrounding surfaces on their way to our ears. Known as reverberation, these reflections distort sound but provide information about the world around us. We asked whether reverberation exhibits statistical regularities that listeners use to separate its effects from those of a sound’s source. We conducted a large-scale statistical analysis of real-world acoustics, revealing strong regularities of reverberation in natural scenes. We found that human listeners can estimate the contributions of the source and the environment from reverberant sound, but that they depend critically on whether environmental acoustics conform to the observed statistical regularities. The results suggest a separation process constrained by knowledge of environmental acoustics that is internalized over development or evolution. In everyday listening, sound reaches our ears directly from a source as well as indirectly via reflections known as reverberation. Reverberation profoundly distorts the sound from a source, yet humans can both identify sound sources and distinguish environments from the resulting sound, via mechanisms that remain unclear. The core computational challenge is that the acoustic signatures of the source and environment are combined in a single signal received by the ear. Here we ask whether our recognition of sound sources and spaces reflects an ability to separate their effects and whether any such separation is enabled by statistical regularities of real-world reverberation. To first determine whether such statistical regularities exist, we measured impulse responses (IRs) of 271 spaces sampled from the distribution encountered by humans during daily life. The sampled spaces were diverse, but their IRs were tightly constrained, exhibiting exponential decay at frequency-dependent rates: Mid frequencies reverberated longest whereas higher and lower frequencies decayed more rapidly, presumably due to absorptive properties of materials and air. To test whether humans leverage these regularities, we manipulated IR decay characteristics in simulated reverberant audio. Listeners could discriminate sound sources and environments from these signals, but their abilities degraded when reverberation characteristics deviated from those of real-world environments. Subjectively, atypical IRs were mistaken for sound sources. The results suggest the brain separates sound into contributions from the source and the environment, constrained by a prior on natural reverberation. This separation process may contribute to robust recognition while providing information about spaces around us.}
}

@article{Pratap2020MLSAL,
  title={MLS: A Large-Scale Multilingual Dataset for Speech Research},
  author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
  journal={ArXiv},
  year={2020},
  volume={abs/2012.03411}
}

@inproceedings{commonvoice:2020,
  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
  title = {Common Voice: A Massively-Multilingual Speech Corpus},
  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
  pages = {4211--4215},
  year = 2020
}

@misc{wang2024globe,
  title={GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech}, 
  author={Wenbin Wang and Yang Song and Sanjay Jha},
  year={2024},
  eprint={2406.14875},
  archivePrefix={arXiv},
}

@article{Instruction Speech 2024,
  title={Instruction Speech},
  author={JanAI},
  year=2024,
  month=June},
  url={https://huggingface.co/datasets/jan-hq/instruction-speech}
}

@inproceedings{wang-etal-2021-voxpopuli,
  title = "{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation",
  author = "Wang, Changhan  and
    Riviere, Morgane  and
    Lee, Ann  and
    Wu, Anne  and
    Talnikar, Chaitanya  and
    Haziza, Daniel  and
    Williamson, Mary  and
    Pino, Juan  and
    Dupoux, Emmanuel",
  booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
  month = aug,
  year = "2021",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-long.80",
  pages = "993--1003",
}

@article{fleurs2022arxiv,
  title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  journal={arXiv preprint arXiv:2205.12446},
  url = {https://arxiv.org/abs/2205.12446},
  year = {2022},
}

@misc{vansegbroeck2019dipcodinnerparty,
  title={DiPCo -- Dinner Party Corpus}, 
  author={Maarten Van Segbroeck and Ahmed Zaid and Ksenia Kutsenko and Cirenia Huerta and Tinh Nguyen and Xuewen Luo and Björn Hoffmeister and Jan Trmal and Maurizio Omologo and Roland Maas},
  year={2019},
  eprint={1909.13447},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/1909.13447}, 
}

@software{siebert2023npy_append_array,
  author = {Michael Siebert and Joshua Adelman and Yoav Git},
  title = {xor2k/npy-append-array: 0.9.16},
  version = {0.9.16},
  doi = {10.5281/zenodo.13820984},
  url = {https://github.com/xor2k/npy-append-array/tree/0.9.16},
  year = {2023},
  month = {feb},
  day = {24},
  abstract = {Create Numpy .npy files by appending on the growth axis},
  license = {MIT},
  orcid = {0000-0002-1369-6321},
  type = {software},
}