Estimin3n - Opensource multimodal kazakh audio/text to text LLM

audio (16 kHz, mono) and/or text as input,
generating text responses or transcriptions.

Backbone: Gemma 3N. Input formatting is done via AutoProcessor.apply_chat_template. This repository contains full weights (safetensors) and processor/tokenizer configs.

Hardware support is provided in partnership with ait🍅maton

Terms of Use: Terms

Benchmark Results

Audio

Word error rate (WER) over the KSC2-test and other benchmarks :

Text

Installation

pip install -U transformers accelerate soundfile librosa safetensors
# Optional for 4‑bit loading
pip install -U bitsandbytes

Recent transformers is recommended (>= 4.44).

Quickstart (Python)

Replace your-username/Estimin3n with your model repo id.

Audio transcription (Kazakh)

import soundfile as sf
from transformers import AutoProcessor, Gemma3nForConditionalGeneration
import torch

repo_id = "your-username/Estimin3n"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

model = Gemma3nForConditionalGeneration.from_pretrained(
    repo_id, trust_remote_code=True, torch_dtype=dtype, device_map="auto"
)
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)

audio, sr = sf.read("sample.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)
audio = audio.astype("float32")

messages = [
    {"role": "system", "content": [{"type": "text", "text": (
        "You are an expert assistant that accurately transcribes Kazakh speech. "
        "Output clean Kazakh text only."
    )}]},
    {"role": "user", "content": [
        {"type": "audio", "audio": audio},
        {"type": "text", "text": "Transcribe this audio."}
    ]},
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(device)

if "input_features" in inputs:
    inputs["input_features"] = inputs["input_features"].to(dtype)
for k in ["input_ids", "attention_mask"]:
    if k in inputs:
        inputs[k] = inputs[k].long()

with torch.no_grad():
    out = model.generate(
        **inputs, max_new_tokens=256, do_sample=False,
        pad_token_id=processor.tokenizer.eos_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
    )
gen = out[0][inputs["input_ids"].shape[1]:]
print(processor.tokenizer.decode(gen, skip_special_tokens=True).strip())

3) Text → text (no audio)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "your-username/Estimin3n"
tok = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto"
)

prompt = "Қазақ тілінде қысқаша сәлемдесу мәтінін жазыңыз."
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Loading options

bfloat16/float16 on ≥24GB VRAM is recommended for best quality.
4‑bit (bitsandbytes) reduces VRAM to ~8–12GB at some quality cost.

from transformers import Gemma3nForConditionalGeneration
model = Gemma3nForConditionalGeneration.from_pretrained(
  "your-username/Estimin3n",
  trust_remote_code=True,
  device_map="auto",
  load_in_4bit=True,
)

Training summary

Base: unsloth/gemma-3n-E4B-it
Method: SFT (TRL SFTTrainer) with LoRA
LoRA: r=8, lora_alpha=16, lora_dropout=0, targets in attention/MLP and audio submodules
Example SFT params: per_device_train_batch_size=1, gradient_accumulation_steps=2, lr=5e-5, epochs>=1, cosine, weight_decay=0.01, remove_unused_columns=False, dataset_text_field="messages", max_seq_length=2048
Data: custom Kazakh audio corpus (FLAC + transcripts) converted to HF DatasetDict

Evaluation

Companion scripts provide evaluation pipelines:

WER/CER on your dataset (logs to TSV with detailed samples)
KazMMLU multiple‑choice benchmark (Kazakh/Russian subsets)

Public benchmark scores will be added in upcoming releases.

Limitations & Safety

The model can generate inaccurate or biased outputs; human review is advised.
For audio inputs, ensure user privacy and consent.
Quality depends on recording conditions (noise, accent, speed).

Acknowledgements

Google Gemma 3N; Hugging Face transformers / datasets / trl
Unsloth (efficient fine‑tuning & quantization)
KazMMLU community
ait🍅maton - https://huggingface.co/aitomaton

Citations

If this project or its results are useful, please cite the KSC2 dataset paper:

@inproceedings{mussakhojayeva22_interspeech,
  title     = {KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus},
  author    = {Saida Mussakhojayeva and Yerbolat Khassanov and Huseyin {Atakan Varol}},
  year      = {2022},
  booktitle = {Interspeech 2022},
  pages     = {1367--1371},
  doi       = {10.21437/Interspeech.2022-421},
  issn      = {2958-1796},
}

@article{gemma_3n_2025,
    title={Gemma 3n},
    url={https://ai.google.dev/gemma/docs/gemma-3n},
    publisher={Google DeepMind},
    author={Gemma Team},
    year={2025}
}

tsu