Ayoub-Laachir's picture
Update README.md
d021e05 verified
metadata
license: apache-2.0
datasets:
  - Ayoub-Laachir/Darija_Dataset
language:
  - dj
metrics:
  - wer
  - cer
base_model:
  - openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition

Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)

Model Overview

Model Name: Whisper Large V3 (Fine-tuned for Moroccan Darija)
Author: Ayoub Laachir
License: apache-2.0
Repository: Ayoub-Laachir/MaghrebVoice
Dataset: Ayoub-Laachir/Darija_Dataset

Description

This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.

Technologies Used

  • Whisper Large V3: OpenAI’s state-of-the-art speech recognition model
  • PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation): An efficient fine-tuning technique
  • Google Colab: Cloud environment for training the model
  • Hugging Face: Hosting the dataset and final model

Dataset Preparation

The dataset preparation involved several steps:

  1. Cleaning: Correcting bad transcriptions and standardizing word spellings.
  2. Audio Processing: Converting all samples to a 16 kHz sample rate.
  3. Dataset Split: Creating a training set of 3,312 samples and a test set of 150 samples.
  4. Format Conversion: Transforming the dataset into the parquet file format.
  5. Uploading: The prepared dataset was uploaded to the Hugging Face Hub.

Training Process

The model was fine-tuned using the following parameters:

  • Per device train batch size: 8
  • Gradient accumulation steps: 1
  • Learning rate: 1e-4 (0.0001)
  • Warmup steps: 200
  • Number of train epochs: 2
  • Logging and evaluation: every 50 steps
  • Weight decay: 0.01

Training progress showed a steady decrease in both training and validation loss over 8000 steps.

Testing and Evaluation

The model was evaluated using:

  • Word Error Rate (WER): 3.1467%
  • Character Error Rate (CER): 2.3893%

These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.

The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.

Audio Transcription Script with PEFT Layers

This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija, incorporating PEFT (Parameter-Efficient Fine-Tuning) layers for improved performance.

Required Libraries

Before running the script, ensure you have the following libraries installed. You can install them using:

!pip install --upgrade pip
!pip install --upgrade transformers accelerate librosa soundfile pydub
!pip install peft==0.3.0  # Install PEFT library
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import librosa
import soundfile as sf
from pydub import AudioSegment
from peft import PeftModel, PeftConfig  # Import PEFT classes

# Set the device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Configuration for the base Whisper model
base_model_name = "openai/whisper-large-v3"  # Base model for Whisper
processor = AutoProcessor.from_pretrained(base_model_name)  # Load the processor

# Load your fine-tuned model configuration
model_name = "Ayoub-Laachir/MaghrebVoice_OnlyLoRaLayers"  # Fine-tuned model with LoRA layers
peft_config = PeftConfig.from_pretrained(model_name)  # Load PEFT configuration

# Load the base model
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(base_model_name).to(device)  # Load the base model

# Load the PEFT model
model = PeftModel.from_pretrained(base_model, model_name).to(device)  # Load the PEFT model

# Merge the LoRA weights with the base model
model = model.merge_and_unload()  # Combine the LoRA weights into the base model

# Configuration for transcription
config = {
    "language": "arabic",  # Language for transcription
    "task": "transcribe",  # Task type
    "chunk_length_s": 30,  # Length of each audio chunk in seconds
    "stride_length_s": 5,   # Overlap between chunks in seconds
}

# Initialize the automatic speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,  # Use the merged model
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    chunk_length_s=config["chunk_length_s"],
    stride_length_s=config["stride_length_s"],
)

# Convert audio to 16kHz sampling rate
def convert_audio_to_16khz(input_path, output_path):
    audio, sr = librosa.load(input_path, sr=None)  # Load the audio file
    audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)  # Resample to 16kHz
    sf.write(output_path, audio_16k, 16000)  # Save the converted audio

# Format time in HH:MM:SS.milliseconds
def format_time(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = seconds % 60
    return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"

# Transcribe audio file
def transcribe_audio(audio_path):
    try:
        result = pipe(audio_path, return_timestamps=True)  # Transcribe audio and get timestamps
        return result["chunks"]  # Return transcription chunks
    except Exception as e:
        print(f"Error transcribing audio: {e}")
        return None

# Main function to execute the transcription process
def main():
    # Specify input and output audio paths (update paths as needed)
    input_audio_path = "/path/to/your/input/audio.mp3"  # Replace with your input audio path
    output_audio_path = "/path/to/your/output/audio_16khz.wav"  # Replace with your output audio path

    # Convert audio to 16kHz
    convert_audio_to_16khz(input_audio_path, output_audio_path)

    # Transcribe the converted audio
    transcription_chunks = transcribe_audio(output_audio_path)

    if transcription_chunks:
        print("WEBVTT\n")  # Print header for WEBVTT format
        for chunk in transcription_chunks:
            start_time = format_time(chunk["timestamp"][0])  # Format start time
            end_time = format_time(chunk["timestamp"][1])    # Format end time
            text = chunk["text"]                              # Get the transcribed text
            print(f"{start_time} --> {end_time}")           # Print time range
            print(f"{text}\n")                               # Print transcribed text
    else:
        print("Transcription failed.")

if __name__ == "__main__":
    main()

Challenges and Future Improvements

Challenges Encountered

  • Diverse spellings of words in Moroccan Darija
  • Cleaning and standardizing the dataset

Future Improvements

  • Expand the dataset to include more Darija accents and expressions
  • Further fine-tune the model for specific Moroccan regional dialects
  • Explore integration into practical applications like voice assistants and transcription services

Conclusion

This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.