๐Ÿ—ฃ๏ธ Kiswahili Sahihi ASR โ€” Swahili Audio Transcription

This model enables high-quality, long-form Kiswahili speech transcription from multiple audio formats (e.g., .mp3, .wav, .m4a, .aac, .ogg, .flac, .amr) using a simple, efficient pipeline.
Itโ€™s optimized for speed, accuracy, and real-world usability, even on modest hardware.


๐Ÿš€ Key Features

  • โœ… Supports multiple audio formats via FFmpeg + Pydub
  • ๐Ÿง  Built on ๐Ÿค— Transformers
  • ๐Ÿชถ Automatically converts audio to 16 kHz mono
  • โณ Transcribes long recordings using smart chunking (default: 60s per chunk)
  • ๐Ÿ–ฅ๏ธ Works seamlessly on both CPU and GPU
  • ๐ŸŒ Focused on Kiswahili language transcription

๐ŸฆŠExample using the model



# ============================================
# ๐Ÿช„ Full Swahili Audio Transcription Script 
# ============================================
# ๐Ÿ“ฆ Install 
!pip install transformers
!pip install "datasets<4.0.0"
!pip install torchvision==0.21.0 torchaudio==2.6.0 jiwer evaluate
!pip install soundfile librosa accelerate>=0.26.0 tensorboard -U bitsandbytes
!apt-get -y install ffmpeg

import torch
import librosa
import numpy as np
from pydub import AudioSegment
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import os

# =============================
# 1. ๐Ÿ”ธ Model Setup
# =============================
model_id = "keystats/kiswahili_sahihi_asr"
processor = AutoProcessor.from_pretrained(model_id)

# Use float32 to avoid half precision mismatch issues
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to("cuda", dtype=torch.float32)

# =============================
# 2. ๐Ÿ”ธ Convert any format to WAV
# =============================
def convert_to_wav(input_path, output_path="converted.wav"):
    try:
        audio = AudioSegment.from_file(input_path)
        audio = audio.set_frame_rate(16000).set_channels(1)
        audio.export(output_path, format="wav")
        return output_path
    except Exception as e:
        raise RuntimeError(f"โŒ Could not convert file. Check if FFmpeg is installed and file is supported. Error: {e}")

# ๐Ÿ‘‡ Just change this path to your audio file
audio_path = "your swahili audio "
wav_path = convert_to_wav(audio_path)

# =============================
# 3. ๐Ÿ”ธ Load audio and chunk
# =============================
audio_input, sr = librosa.load(wav_path, sr=16000, mono=True)

chunk_length_s = 60  # seconds
chunk_size = chunk_length_s * sr
num_chunks = int(np.ceil(len(audio_input) / chunk_size))

print(f"๐Ÿ”น Total length: {len(audio_input)/sr:.2f} sec | Splitting into {num_chunks} chunks...")

# =============================
# 4. ๐Ÿ”ธ Transcribe each chunk
# =============================
full_transcription = []

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(audio_input))
    chunk = audio_input[start:end]

    inputs = processor(
        chunk,
        sampling_rate=16000,
        return_tensors="pt",
        padding=True
    ).to("cuda", dtype=torch.float32)

    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_length=20000)

    text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    full_transcription.append(text.strip())

# =============================
# 5. ๐Ÿ”ธ Combine final transcript
# =============================
final_text = " ".join(full_transcription)
print(" ๐Ÿ“ Final Transcription:")
print(final_text)

๐Ÿงช Example Output

๐ŸŽง Input Audio ๐Ÿ“ Transcription Output
mashairi_sauti.mp3 โ€œKaribu kwenye mfumo wetu wa Kiswahili Sahihi.โ€
mazungumzo_flac.flac โ€œHabari yako, karibu tena kesho kwa mahojiano mengine.โ€

๐Ÿ› ๏ธ Tips for Best Results

  • Use clear audio without background noise.
  • Long recordings are automatically split into 60-second chunks.
  • Works with .mp3, .wav, .m4a, .aac, .ogg, .flac, .amr and more.
  • Ensure audio is sampled at 16 kHz and mono (automatically handled).

๐ŸŒŸ Acknowledgements


๐Ÿ“ข Contribute

  • ๐Ÿงช Share more Swahili audio samples
  • ๐Ÿง‘โ€๐Ÿ’ป Report issues or improvements
  • ๐ŸŒ Help expand coverage for different accents and dialects

๐Ÿงญ Citation


@model{kiswahili_sahihi_asr,
  author    = {Jackson Kahungu},
  title     = {Kiswahili Sahihi ASR โ€” Swahili Audio Transcription},
  year      = {2025},
  publisher = {Hugging Face}
}


โœจ Final Note

โ€œIf you like the model, leave a like โค๐Ÿงกโคโ€
This model may not be perfect, but it provides a strong baseline for building future Swahili transcription systems.
Together, we can make Swahili voice technology accessible to everyone.โœจ ๐ŸŽŠKISWAHILI KITUKUZWE๐ŸŽ‰

Downloads last month
125
Safetensors
Model size
0.8B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for keystats/kiswahili_sahihi_asr

Finetuned
(715)
this model