Fine-tuned Japanese Whisper model for speech recognition using whisper-base

Fine-tuned openai/whisper-base on Japanese using Common Voice, JVS and JSUT. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly as follows.

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
import librosa
import torch

LANG_ID = "ja"
MODEL_ID = "Ivydata/whisper-base-japanese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="ja", task="transcribe"
)
model.config.suppress_tokens = []

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    batch["sampling_rate"] = sampling_rate
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
sample = test_dataset[0]
input_features = processor(sample["speech"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
# ['<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>木村さんに電話を貸してもらいました。<|endoftext|>']

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# ['木村さんに電話を貸してもらいました。']

Test Result

In the table below I report the Character Error Rate (CER) of the model tested on TEDxJP-10K dataset.

Model	CER
Ivydata/whisper-small-japanese	27.25%
Ivydata/wav2vec2-large-xlsr-53-japanese	27.87%
jonatasgrosman/wav2vec2-large-xlsr-53-japanese	34.18%

Ivydata
/

whisper-base-japanese

Fine-tuned Japanese Whisper model for speech recognition using whisper-base

Usage

Test Result

Dataset used to train Ivydata/whisper-base-japanese