Distilled TunBERT

A distilled, efficient version of TunBERT for Tunisian Arabic. This model is faster, smaller, and fully reproducible thanks to an open Tunisian corpus and transparent distillation pipeline.


Model Details

Model Description

  • Developed by: Hamza Bouajila
  • Model type: Distilled BERT (student: distilbert-base-uncased)
  • Teacher model: TunBERT (frozen)
  • Language(s): Tunisian Arabic (Darija)
  • License: MIT (specify if different)
  • Finetuned from: distilbert-base-uncased
  • Status: Research prototype (not production-ready)

Model Sources

  • Repository: [GitHub Link]
  • Model weights: HuggingFace
  • Paper (draft): Coming soon (arXiv)

Uses

Direct Use

  • Text classification in Tunisian Arabic (e.g., sentiment analysis, topic classification).
  • Research on knowledge distillation for low-resource languages.
  • Educational use in model efficiency, open corpus training, and reproducibility.

Downstream Use

  • Fine-tuning for NLP tasks in Tunisian Arabic: NER, sentiment, intent detection, etc.
  • Embedding-based applications (with caution — embeddings not aligned to teacher).

Out-of-Scope Use

  • Not suitable for semantic search or cross-model embedding alignment.
  • Not recommended for critical applications (e.g., healthcare, law) without further evaluation.

Bias, Risks, and Limitations

  • Bias: Model inherits cultural/linguistic biases present in the Tunisian corpus.

  • Limitations:

    • Embeddings show near-zero similarity with teacher (cosine ≈ 0.02) due to tokenizer mismatch and lack of hidden-state loss.
    • Teacher (TunBERT) itself may have limitations (training data not public).
  • Risk: Misuse in contexts requiring semantic alignment (e.g., search, embeddings).

Recommendations

  • Use for classification/logit-based tasks, not for embedding similarity.
  • Consider retraining with hidden-state alignment if embeddings are needed.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("hamzabouajila/distilled_tunbert")
model = AutoModel.from_pretrained("hamzabouajila/distilled_tunbert")

text = "نحب النموذج هذا يخدم بسرعه"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Training Details

Training Data

  • Source: Curated open Tunisian Arabic corpus (public release).
  • Transparency: Fully documented and reproducible.

Training Procedure

  • Teacher: TunBERT (frozen)
  • Student: distilbert-base-uncased (English) + Tunisian tokenizer
  • Loss: KL-divergence on logits (no hidden-state loss)

Training Hyperparameters

  • Precision: fp16 mixed precision
  • Optimizer: AdamW
  • Batch size / Epochs: [More Information Needed]
  • Learning rate: [More Information Needed]

Speeds, Sizes, Times

  • Parameters: 66M (vs 109M for teacher)
  • Avg inference: 0.058s (vs 0.106s → 1.83× faster)
  • Model size: 1.65× smaller

Evaluation

Testing Data, Factors & Metrics

  • Benchmark task: Tunisian Sentiment Analysis Corpus (TSAC)
  • Metrics: Perplexity, inference speed, parameter count, embedding cosine similarity

Results

Metric Original TunBERT Distilled TunBERT Notes
Perplexity 34838.7 4.26 Strong LM performance. Teacher likely uninitialized.
Inference Time (s) 0.106 0.058 1.83× faster
Parameters 109M 66M 1.65× smaller
Embedding Similarity 0.02 Near-zero due to tokenizer mismatch
Training Data Unknown Open corpus Fully reproducible

Summary

The distilled model is faster, lighter, and trained on open data. It performs competitively on classification tasks but embeddings should not be used for similarity-based applications.


Environmental Impact

  • Hardware: NVIDIA V100 (specify if different)
  • Training hours: [More Information Needed]
  • Cloud provider: [More Information Needed]
  • Carbon emitted: Estimated via ML CO₂ Impact Calculator

Technical Specifications

Model Architecture and Objective

  • Architecture: DistilBERT
  • Objective: Knowledge Distillation (logit alignment only)

Compute Infrastructure

  • Hardware: [e.g., 1× NVIDIA V100 GPU]
  • Software: PyTorch + 🤗 Transformers

Citation

BibTeX:

@misc{bouajila2025distilledtunbert,
  title={Distilled TunBERT: Efficient Tunisian Arabic BERT via Knowledge Distillation},
  author={Bouajila Hamza},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/hamzabouajila/distilled_tunbert}}
}

Model Card Authors

  • Hamza Bouajila

Model Card Contact


👉 This version positions your model as efficient, open, and reproducible — while honestly stating limitations (embeddings, risks).

Do you want me to also draft a shorter, lightweight Hugging Face card (2–3 sections only) for quick readers, in addition to this full professional one?

Downloads last month
77
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hamzabouajila/distilled_tunbert

Base model

tunis-ai/TunBERT
Finetuned
(1)
this model

Dataset used to train hamzabouajila/distilled_tunbert