Distilled TunBERT

A distilled, efficient version of TunBERT for Tunisian Arabic. This model is faster, smaller, and fully reproducible thanks to an open Tunisian corpus and transparent distillation pipeline.

Model Details

Model Description

Developed by: Hamza Bouajila
Model type: Distilled BERT (student: distilbert-base-uncased)
Teacher model: TunBERT (frozen)
Language(s): Tunisian Arabic (Darija)
License: MIT (specify if different)
Finetuned from: distilbert-base-uncased
Status: Research prototype (not production-ready)

Model Sources

Repository: [GitHub Link]
Model weights: HuggingFace
Paper (draft): Coming soon (arXiv)

Uses

Direct Use

Text classification in Tunisian Arabic (e.g., sentiment analysis, topic classification).
Research on knowledge distillation for low-resource languages.
Educational use in model efficiency, open corpus training, and reproducibility.

Downstream Use

Fine-tuning for NLP tasks in Tunisian Arabic: NER, sentiment, intent detection, etc.
Embedding-based applications (with caution — embeddings not aligned to teacher).

Out-of-Scope Use

Not suitable for semantic search or cross-model embedding alignment.
Not recommended for critical applications (e.g., healthcare, law) without further evaluation.

Bias, Risks, and Limitations

Bias: Model inherits cultural/linguistic biases present in the Tunisian corpus.
Limitations:
- Embeddings show near-zero similarity with teacher (cosine ≈ 0.02) due to tokenizer mismatch and lack of hidden-state loss.
- Teacher (TunBERT) itself may have limitations (training data not public).
Risk: Misuse in contexts requiring semantic alignment (e.g., search, embeddings).

Recommendations

Use for classification/logit-based tasks, not for embedding similarity.
Consider retraining with hidden-state alignment if embeddings are needed.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("hamzabouajila/distilled_tunbert")
model = AutoModel.from_pretrained("hamzabouajila/distilled_tunbert")

text = "نحب النموذج هذا يخدم بسرعه"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Training Details

Training Data

Source: Curated open Tunisian Arabic corpus (public release).
Transparency: Fully documented and reproducible.

Training Procedure

Teacher: TunBERT (frozen)
Student: distilbert-base-uncased (English) + Tunisian tokenizer
Loss: KL-divergence on logits (no hidden-state loss)

Training Hyperparameters

Precision: fp16 mixed precision
Optimizer: AdamW
Batch size / Epochs: [More Information Needed]
Learning rate: [More Information Needed]

Speeds, Sizes, Times

Parameters: 66M (vs 109M for teacher)
Avg inference: 0.058s (vs 0.106s → 1.83× faster)
Model size: 1.65× smaller

Evaluation

Testing Data, Factors & Metrics

Benchmark task: Tunisian Sentiment Analysis Corpus (TSAC)
Metrics: Perplexity, inference speed, parameter count, embedding cosine similarity

Results

Metric	Original TunBERT	Distilled TunBERT	Notes
Perplexity	34838.7	4.26	Strong LM performance. Teacher likely uninitialized.
Inference Time (s)	0.106	0.058	1.83× faster
Parameters	109M	66M	1.65× smaller
Embedding Similarity	—	0.02	Near-zero due to tokenizer mismatch
Training Data	Unknown	Open corpus	Fully reproducible

Summary

The distilled model is faster, lighter, and trained on open data. It performs competitively on classification tasks but embeddings should not be used for similarity-based applications.

Environmental Impact

Hardware: NVIDIA V100 (specify if different)
Training hours: [More Information Needed]
Cloud provider: [More Information Needed]
Carbon emitted: Estimated via ML CO₂ Impact Calculator

Technical Specifications

Model Architecture and Objective

Architecture: DistilBERT
Objective: Knowledge Distillation (logit alignment only)

Compute Infrastructure

Hardware: [e.g., 1× NVIDIA V100 GPU]
Software: PyTorch + 🤗 Transformers

Citation

BibTeX:

@misc{bouajila2025distilledtunbert,
  title={Distilled TunBERT: Efficient Tunisian Arabic BERT via Knowledge Distillation},
  author={Bouajila Hamza},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/hamzabouajila/distilled_tunbert}}
}

Model Card Authors

Hamza Bouajila

Model Card Contact

Email: [bouajilahamza@outlook.com]
LinkedIn: [https://www.linkedin.com/in/hamzabouajila]

👉 This version positions your model as efficient, open, and reproducible — while honestly stating limitations (embeddings, risks).

Do you want me to also draft a shorter, lightweight Hugging Face card (2–3 sections only) for quick readers, in addition to this full professional one?

Downloads last month: 77

Safetensors

Model size

67M params

Tensor type

F32

Model tree for hamzabouajila/distilled_tunbert

Base model

tunis-ai/TunBERT

Finetuned

(1)

this model

hamzabouajila
/

distilled_tunbert