Update forced decoder ids
#79
by
sanchit-gandhi
HF staff
- opened
The forced decoder ids for large-v3
currently set the default task to translate:
>>> from transformers import WhisperTokenizer, GenerationConfig
>>> tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large-v3")
>>> generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v3")
>>> generation_config.forced_decoder_ids
[[1, None], [2, 50359]]
>>> tokenizer.decode(generation_config.forced_decoder_ids[1][1])
'<|translate|>'
Whereas for large-v2
and the other multilingual models, it's set to transcribe:
>>> from transformers import WhisperTokenizer, GenerationConfig
>>> tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large-v2")
>>> generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v2")
>>> generation_config.forced_decoder_ids
[[1, None], [2, 50359]]
>>> tokenizer.decode(generation_config.forced_decoder_ids[1][1])
'<|transcribe|>'
This PR updates the forced decoder ids for large-v3
to be consistent with the other multilingual Whisper models (transcribe).
sanchit-gandhi
changed pull request status to
merged