Source of audio used to train Whisper
#15
by
mahelona
- opened
Aloha,
What are the sources for the audio used in training Whisper? In particular, our team are interested to know where the 1381 hours of te reo Māori and 338 hours of ʻōlelo Hawaiʻi are taken from. Were these just scraped from YouTube?
Thank you,
Keoni.
Hey @mahelona ! These are secrets only OpenAI know... All the knowledge about the training data that's in the public domain can be found in the Whisper paper: https://arxiv.org/pdf/2212.04356.pdf
The trained checkpoints were publicly release (and are thus hosted on the HF Hub), however the dataset remains behind closed doors. I would also very much like to know more details about the training data! But the situation is unlikely to change here
thanks for the link, that helps