Estimin3n - Opensource multimodal kazakh audio/text to text LLM
- audio (16 kHz, mono) and/or text as input,
- generating text responses or transcriptions.
Backbone: Gemma 3N. Input formatting is done via AutoProcessor.apply_chat_template. This repository contains full weights (safetensors) and processor/tokenizer configs.
Hardware support is provided in partnership with ait🍅maton
Terms of Use: Terms
Benchmark Results
Audio
Word error rate (WER) over the KSC2-test and other benchmarks :
Text
Installation
pip install -U transformers accelerate soundfile librosa safetensors
# Optional for 4‑bit loading
pip install -U bitsandbytes
Recent transformers is recommended (>= 4.44).
Quickstart (Python)
Replace your-username/Estimin3n with your model repo id.
Audio transcription (Kazakh)
import soundfile as sf
from transformers import AutoProcessor, Gemma3nForConditionalGeneration
import torch
repo_id = "your-username/Estimin3n"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
model = Gemma3nForConditionalGeneration.from_pretrained(
repo_id, trust_remote_code=True, torch_dtype=dtype, device_map="auto"
)
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
audio, sr = sf.read("sample.wav")
if audio.ndim > 1:
audio = audio.mean(axis=1)
audio = audio.astype("float32")
messages = [
{"role": "system", "content": [{"type": "text", "text": (
"You are an expert assistant that accurately transcribes Kazakh speech. "
"Output clean Kazakh text only."
)}]},
{"role": "user", "content": [
{"type": "audio", "audio": audio},
{"type": "text", "text": "Transcribe this audio."}
]},
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(device)
if "input_features" in inputs:
inputs["input_features"] = inputs["input_features"].to(dtype)
for k in ["input_ids", "attention_mask"]:
if k in inputs:
inputs[k] = inputs[k].long()
with torch.no_grad():
out = model.generate(
**inputs, max_new_tokens=256, do_sample=False,
pad_token_id=processor.tokenizer.eos_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
)
gen = out[0][inputs["input_ids"].shape[1]:]
print(processor.tokenizer.decode(gen, skip_special_tokens=True).strip())
3) Text → text (no audio)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo_id = "your-username/Estimin3n"
tok = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto"
)
prompt = "Қазақ тілінде қысқаша сәлемдесу мәтінін жазыңыз."
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=64, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Loading options
- bfloat16/float16 on ≥24GB VRAM is recommended for best quality.
- 4‑bit (
bitsandbytes) reduces VRAM to ~8–12GB at some quality cost.
from transformers import Gemma3nForConditionalGeneration
model = Gemma3nForConditionalGeneration.from_pretrained(
"your-username/Estimin3n",
trust_remote_code=True,
device_map="auto",
load_in_4bit=True,
)
Training summary
- Base:
unsloth/gemma-3n-E4B-it - Method: SFT (TRL
SFTTrainer) with LoRA - LoRA:
r=8,lora_alpha=16,lora_dropout=0, targets in attention/MLP and audio submodules - Example SFT params:
per_device_train_batch_size=1,gradient_accumulation_steps=2,lr=5e-5,epochs>=1,cosine,weight_decay=0.01,remove_unused_columns=False,dataset_text_field="messages",max_seq_length=2048 - Data: custom Kazakh audio corpus (FLAC + transcripts) converted to HF
DatasetDict
Evaluation
Companion scripts provide evaluation pipelines:
- WER/CER on your dataset (logs to TSV with detailed samples)
- KazMMLU multiple‑choice benchmark (Kazakh/Russian subsets)
Public benchmark scores will be added in upcoming releases.
Limitations & Safety
- The model can generate inaccurate or biased outputs; human review is advised.
- For audio inputs, ensure user privacy and consent.
- Quality depends on recording conditions (noise, accent, speed).
Acknowledgements
- Google Gemma 3N; Hugging Face
transformers/datasets/trl - Unsloth (efficient fine‑tuning & quantization)
- KazMMLU community
- ait🍅maton - https://huggingface.co/aitomaton
Citations
If this project or its results are useful, please cite the KSC2 dataset paper:
@inproceedings{mussakhojayeva22_interspeech,
title = {KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus},
author = {Saida Mussakhojayeva and Yerbolat Khassanov and Huseyin {Atakan Varol}},
year = {2022},
booktitle = {Interspeech 2022},
pages = {1367--1371},
doi = {10.21437/Interspeech.2022-421},
issn = {2958-1796},
}
@article{gemma_3n_2025,
title={Gemma 3n},
url={https://ai.google.dev/gemma/docs/gemma-3n},
publisher={Google DeepMind},
author={Gemma Team},
year={2025}
}
tsu
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for govnejri/Estimin3n
Evaluation results
- Word Error Rateself-reportedN/A
- Character Error Rateself-reportedN/A


