0khacha
/

darija-toxicity-classifier

 - arabizi
 - morocco
 - bert
+---
+# Darija Toxicity Classifier 🇲🇦
+A transformer-based NLP model for detecting toxic content in Moroccan Darija and Arabizi.
+This model is specifically designed to handle the linguistic complexity of Moroccan dialect, including Arabizi (Arabic written in Latin characters with numbers) such as:
+* `3` → ع
+* `7` → ح
+* `9` → ق
+It also supports code-switched text mixing Darija, Arabic, French, English, and Tamazight.
+---
+## 📌 Model Overview
+| Property | Value |
+|----------|-------|
+| **Model ID** | `0khacha/darija-toxicity-classifier` |
+| **Architecture** | Fine-tuned from `SI2M-Lab/DarijaBERT-arabizi` |
+| **Task** | Binary Sequence Classification (Safe / Toxic) |
+| **Framework** | Hugging Face Transformers |
+| **Training Data** | 16,000+ labeled Moroccan Darija/Arabizi samples |
+---
+## 🚀 Quick Inference (Transformers)
+```python
+from transformers import pipeline
+classifier = pipeline(
+    "text-classification",
+    model="0khacha/darija-toxicity-classifier"
+)
+result = classifier("salam khouya")
+print(result)
+# Output: [{'label': 'SAFE', 'score': 0.9845}]
+```
+---
+## 🧠 What Makes This Model Special?
+### 🌍 Dialect-Aware
+Built specifically for Moroccan linguistic patterns — not generic Arabic.
+### 🔢 Arabizi Handling
+Understands numeric character substitutions like:
+* `in3al`
+* `sa7a`
+* `3likom`
+### 🧹 Custom Preprocessing
+The model was trained with specialized normalization:
+* Lowercasing
+* Removing dash/underscore splitting (`w-a-l-o` → `walo`)
+* Fixing spaced characters (`n 3 a l` → `n3al`)
+* Reducing elongation (`heeeey` → `hey`)
+* Whitespace normalization
+---
+## 📊 Performance
+| Metric | Score |
+|--------|-------|
+| **Accuracy** | ~94% |
+| **F1-Score** | ~93% |
+| **Inference Speed (GPU)** | ~50ms |
+> **Note:** Performance may vary depending on hardware and deployment setup.
+---
+## 📖 Example Predictions
+### Example 1: Safe Content
+**Input:**
+```python
+"bghit nakol"
+```
+**Output:**
+```python
+Safe (98.45%)
+```
+### Example 2: Toxic Content
+**Input:**
+```python
+"rak stupid"
+```
+**Output:**
+```python
+Toxic
+```
+---
+## ⚠️ Limitations
+* May struggle with extremely rare slang
+* Context-dependent toxicity (sarcasm) may reduce accuracy
+* Not intended for legal or automated moderation without human review
+---
+## 🔒 Dataset & Privacy
+The training dataset is not publicly available for privacy and ethical reasons.
+For research collaboration: 📩 [mohamedkhacha99@gmail.com](mailto:mohamedkhacha99@gmail.com)
+---
+## 📜 License
+MIT License
+---
+## 🙏 Acknowledgments
+* **DarijaBERT team** at SI2M-Lab
+* **Hugging Face** Transformers ecosystem
+* **PyTorch**
+* The **Moroccan NLP community**
+---
+## 📚 Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{darija-toxicity-classifier,
+  author = {Khacha, Mohamed},
+  title = {Darija Toxicity Classifier},
+  year = {2024},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/0khacha/darija-toxicity-classifier}
+}
+```
+---
+## 🤝 Contributing
+Contributions, issues, and feature requests are welcome!
+Feel free to check the [issues page](https://huggingface.co/0khacha/darija-toxicity-classifier/discussions).
+---
+**Made with ❤️ for the Moroccan NLP community**