Source of audio used to train Whisper

#15
by mahelona - opened

Aloha,

What are the sources for the audio used in training Whisper? In particular, our team are interested to know where the 1381 hours of te reo Māori and 338 hours of ʻōlelo Hawaiʻi are taken from. Were these just scraped from YouTube?

Thank you,
Keoni.

Hey @mahelona ! These are secrets only OpenAI know... All the knowledge about the training data that's in the public domain can be found in the Whisper paper: https://arxiv.org/pdf/2212.04356.pdf

The trained checkpoints were publicly release (and are thus hosted on the HF Hub), however the dataset remains behind closed doors. I would also very much like to know more details about the training data! But the situation is unlikely to change here

thanks for the link, that helps

Sign up or log in to comment