Distilled TunBERT
A distilled, efficient version of TunBERT for Tunisian Arabic. This model is faster, smaller, and fully reproducible thanks to an open Tunisian corpus and transparent distillation pipeline.
Model Details
Model Description
- Developed by: Hamza Bouajila
- Model type: Distilled BERT (student:
distilbert-base-uncased) - Teacher model: TunBERT (frozen)
- Language(s): Tunisian Arabic (Darija)
- License: MIT (specify if different)
- Finetuned from:
distilbert-base-uncased - Status: Research prototype (not production-ready)
Model Sources
- Repository: [GitHub Link]
- Model weights: HuggingFace
- Paper (draft): Coming soon (arXiv)
Uses
Direct Use
- Text classification in Tunisian Arabic (e.g., sentiment analysis, topic classification).
- Research on knowledge distillation for low-resource languages.
- Educational use in model efficiency, open corpus training, and reproducibility.
Downstream Use
- Fine-tuning for NLP tasks in Tunisian Arabic: NER, sentiment, intent detection, etc.
- Embedding-based applications (with caution — embeddings not aligned to teacher).
Out-of-Scope Use
- Not suitable for semantic search or cross-model embedding alignment.
- Not recommended for critical applications (e.g., healthcare, law) without further evaluation.
Bias, Risks, and Limitations
Bias: Model inherits cultural/linguistic biases present in the Tunisian corpus.
Limitations:
- Embeddings show near-zero similarity with teacher (
cosine ≈ 0.02) due to tokenizer mismatch and lack of hidden-state loss. - Teacher (TunBERT) itself may have limitations (training data not public).
- Embeddings show near-zero similarity with teacher (
Risk: Misuse in contexts requiring semantic alignment (e.g., search, embeddings).
Recommendations
- Use for classification/logit-based tasks, not for embedding similarity.
- Consider retraining with hidden-state alignment if embeddings are needed.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("hamzabouajila/distilled_tunbert")
model = AutoModel.from_pretrained("hamzabouajila/distilled_tunbert")
text = "نحب النموذج هذا يخدم بسرعه"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Training Details
Training Data
- Source: Curated open Tunisian Arabic corpus (public release).
- Transparency: Fully documented and reproducible.
Training Procedure
- Teacher: TunBERT (frozen)
- Student: distilbert-base-uncased (English) + Tunisian tokenizer
- Loss: KL-divergence on logits (no hidden-state loss)
Training Hyperparameters
- Precision: fp16 mixed precision
- Optimizer: AdamW
- Batch size / Epochs: [More Information Needed]
- Learning rate: [More Information Needed]
Speeds, Sizes, Times
- Parameters: 66M (vs 109M for teacher)
- Avg inference: 0.058s (vs 0.106s → 1.83× faster)
- Model size: 1.65× smaller
Evaluation
Testing Data, Factors & Metrics
- Benchmark task: Tunisian Sentiment Analysis Corpus (TSAC)
- Metrics: Perplexity, inference speed, parameter count, embedding cosine similarity
Results
| Metric | Original TunBERT | Distilled TunBERT | Notes |
|---|---|---|---|
| Perplexity | 34838.7 | 4.26 | Strong LM performance. Teacher likely uninitialized. |
| Inference Time (s) | 0.106 | 0.058 | 1.83× faster |
| Parameters | 109M | 66M | 1.65× smaller |
| Embedding Similarity | — | 0.02 | Near-zero due to tokenizer mismatch |
| Training Data | Unknown | Open corpus | Fully reproducible |
Summary
The distilled model is faster, lighter, and trained on open data. It performs competitively on classification tasks but embeddings should not be used for similarity-based applications.
Environmental Impact
- Hardware: NVIDIA V100 (specify if different)
- Training hours: [More Information Needed]
- Cloud provider: [More Information Needed]
- Carbon emitted: Estimated via ML CO₂ Impact Calculator
Technical Specifications
Model Architecture and Objective
- Architecture: DistilBERT
- Objective: Knowledge Distillation (logit alignment only)
Compute Infrastructure
- Hardware: [e.g., 1× NVIDIA V100 GPU]
- Software: PyTorch + 🤗 Transformers
Citation
BibTeX:
@misc{bouajila2025distilledtunbert,
title={Distilled TunBERT: Efficient Tunisian Arabic BERT via Knowledge Distillation},
author={Bouajila Hamza},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/hamzabouajila/distilled_tunbert}}
}
Model Card Authors
- Hamza Bouajila
Model Card Contact
- Email: [bouajilahamza@outlook.com]
- LinkedIn: [https://www.linkedin.com/in/hamzabouajila]
👉 This version positions your model as efficient, open, and reproducible — while honestly stating limitations (embeddings, risks).
Do you want me to also draft a shorter, lightweight Hugging Face card (2–3 sections only) for quick readers, in addition to this full professional one?
- Downloads last month
- 77
Model tree for hamzabouajila/distilled_tunbert
Base model
tunis-ai/TunBERT