Kosuke-Szk's picture
Update README.md
4bf6025
|
raw
history blame
2.5 kB
metadata
license: apache-2.0
datasets:
  - common_voice
language:
  - ja
tags:
  - audio

Fine-tuned Japanese Whisper model for speech recognition using whisper-base

Fine-tuned openai/whisper-base on Japanese using Common Voice, JVS and JSUT. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly as follows.

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
import librosa
import torch

LANG_ID = "ja"
MODEL_ID = "Ivydata/whisper-base-japanese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="ja", task="transcribe"
)
model.config.suppress_tokens = []

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    batch["sampling_rate"] = sampling_rate
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
sample = test_dataset[0]
input_features = processor(sample["speech"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
# ['<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>ζœ¨ζ‘γ•γ‚“γ«ι›»θ©±γ‚’θ²Έγ—γ¦γ‚‚γ‚‰γ„γΎγ—γŸγ€‚<|endoftext|>']

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# ['ζœ¨ζ‘γ•γ‚“γ«ι›»θ©±γ‚’θ²Έγ—γ¦γ‚‚γ‚‰γ„γΎγ—γŸγ€‚']

Test Result

In the table below I report the Character Error Rate (CER) of the model tested on TEDxJP-10K dataset.

Model CER
Ivydata/whisper-small-japanese 27.25%
Ivydata/wav2vec2-large-xlsr-53-japanese 27.87%
jonatasgrosman/wav2vec2-large-xlsr-53-japanese 34.18%