metadata
license: apache-2.0
datasets:
- common_voice
language:
- ja
tags:
- audio
Fine-tuned Japanese Whisper model for speech recognition using whisper-base
Fine-tuned openai/whisper-base on Japanese using Common Voice, JVS and JSUT. When using this model, make sure that your speech input is sampled at 16kHz.
Usage
The model can be used directly as follows.
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
import librosa
import torch
LANG_ID = "ja"
MODEL_ID = "Ivydata/whisper-base-japanese"
SAMPLES = 10
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
language="ja", task="transcribe"
)
model.config.suppress_tokens = []
# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
batch["sampling_rate"] = sampling_rate
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
sample = test_dataset[0]
input_features = processor(sample["speech"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
# ['<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>ζ¨ζγγγ«ι»θ©±γθ²Έγγ¦γγγγΎγγγ<|endoftext|>']
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# ['ζ¨ζγγγ«ι»θ©±γθ²Έγγ¦γγγγΎγγγ']
Test Result
In the table below I report the Character Error Rate (CER) of the model tested on TEDxJP-10K dataset.
Model | CER |
---|---|
Ivydata/whisper-small-japanese | 27.25% |
Ivydata/wav2vec2-large-xlsr-53-japanese | 27.87% |
jonatasgrosman/wav2vec2-large-xlsr-53-japanese | 34.18% |