huseinzol05's picture
Update README.md
7a4a613
|
raw
history blame
926 Bytes
metadata
language:
  - ms
  - en

Malaysian Distil Whisper Large V3

Distil Whisper Large V3 on Malaysian dataset,

  1. IMDA STT, https://huggingface.co/datasets/mesolitica/IMDA-STT
  2. Pseudolabel Malaysian youtube videos, https://huggingface.co/datasets/mesolitica/pseudolabel-malaysian-youtube-whisper-large-v3
  3. Malay Conversational Speech Corpus, https://huggingface.co/datasets/malaysia-ai/malay-conversational-speech-corpus
  4. Haqkiem TTS Dataset, this is private, but you request access from https://www.linkedin.com/in/haqkiem-daim/
  5. Pseudolabel Nusantara audiobooks, https://huggingface.co/datasets/mesolitica/nusantara-audiobook

We follow exact distillation process from https://github.com/huggingface/distil-whisper with minor changes, script at https://github.com/mesolitica/malaya-speech/tree/malaysian-speech/session/distill-whisper

Wandb at https://wandb.ai/huseinzol05/distil-whisper?workspace=user-huseinzol05