Request for open sourcing evaluation code --- at least for librispeech

#38

by sovitrath - opened Jul 29

Jul 29

I am going through the new Voxtral model by Mistral AI. Working on something and comparing benchmarks with the OpenAI Whisper models is essential.

In the tech report (https://arxiv.org/pdf/2507.13264), Voxtral Mini is mentioned to achieve 1.86 WER on Librispeech Clean Test set. But I am unable to reproduce it, the evaluation code is not open source as well. I tried without normalizing the text, and the WER was essentially 100%. Then tried with text normalization using the Hugging Face Whisper processor normalizer (not ideal, but could not find any other way), and the WER is now 33%.

On the contrary, I was able to reproduce the WER numbers from the OpenAI Whisper papers using the Hugging Face evaluation code from the OpenAI Whisper model pages.

I am linking the notebook below for suggestions.
If anyone from Mistral AI can help, it will be good.

Link to notebook => https://colab.research.google.com/drive/1Ob9DwqBvGDobLjiOT3zLJV8IL-Xohu9r?authuser=1

sanchit-gandhi

Aug 4

Hey @sovitrath - I can't access the notebook, but to confirm: are you using the TranscriptionRequest API? https://huggingface.co/mistralai/Voxtral-Mini-3B-2507#transcription

The results you're getting suggest you might be using the ChatCompletionRequest API, which is not recommended for ASR, but rather Chat only.

sovitrath

Aug 4

Hello @sanchit-gandhi
I am running the model locally and using apply_transcription_request function. Basically, the following is the logic.

def map_to_pred(batch):
    audio = batch['audio']

    path = f'temp.wav'
    torchaudio.save(path, torch.tensor([audio['array']]), audio['sampling_rate'])

    input_features = processor.apply_transcrition_request(
        language='en',
        audio='temp.wav',
        sampling_rate=audio['sampling_rate'],
        # return_tensors='pt',
        model_id=repo_id,
        # format=[audio['array']]
    ).to(device, dtype=torch.bfloat16)

    # batch['reference'] = processor.tokenizer._normalize(batch['text'])
    batch['reference'] = whisper_processor.tokenizer._normalize(batch['text'])
    # batch['reference'] = batch['text']

    with torch.no_grad():
        predicted_ids = model.generate(**input_features)
    transcription = processor.batch_decode(
        predicted_ids[:, input_features.input_ids.shape[1]:],
        skip_special_tokens=True
    )
    # print(transcription)
    # print('#'*50)
    # batch['prediction'] = processor.tokenizer._normalize(transcription[0]) # Prediction is a single list.
    batch['prediction'] = whisper_processor.tokenizer._normalize(transcription[0]) # Prediction is a single list.
    # batch['prediction'] = transcription[0] # Prediction is a single list.

    # print(f'Reference: {batch['reference']}')
    # print(f'Predictions: {batch['prediction']}')
    return batch

The notebook I linked above is public. Not sure why it is not accessible. Please let me know if there is any other to send the code.

macorama

Aug 4

@sanchit-gandhi Try without authuser=1 at the end :)
https://colab.research.google.com/drive/1Ob9DwqBvGDobLjiOT3zLJV8IL-Xohu9r

eustlb

Sep 2

Hey there,

Here is a reproducer using the Transformers implementation, yielding 1.89% WER on librispeech clean test set 🤗
As can be found in the inference snippets and the doc, you need to set the max_new_tokens generation parameter 😉.

Indeed, in Transformers, we go with the convention of setting this parameter to 20 by default.

sovitrath

Sep 2

Thank you @eustlb
This helps a lot.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment