0khacha commited on
Commit
91c6912
Β·
verified Β·
1 Parent(s): 6164ac2

update README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -1
README.md CHANGED
@@ -15,4 +15,165 @@ tags:
15
  - arabizi
16
  - morocco
17
  - bert
18
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  - arabizi
16
  - morocco
17
  - bert
18
+ ---
19
+
20
+
21
+ # Darija Toxicity Classifier πŸ‡²πŸ‡¦
22
+
23
+ A transformer-based NLP model for detecting toxic content in Moroccan Darija and Arabizi.
24
+
25
+ This model is specifically designed to handle the linguistic complexity of Moroccan dialect, including Arabizi (Arabic written in Latin characters with numbers) such as:
26
+ * `3` β†’ ΨΉ
27
+ * `7` β†’ Ψ­
28
+ * `9` β†’ Ω‚
29
+
30
+ It also supports code-switched text mixing Darija, Arabic, French, English, and Tamazight.
31
+
32
+ ---
33
+
34
+ ## πŸ“Œ Model Overview
35
+
36
+ | Property | Value |
37
+ |----------|-------|
38
+ | **Model ID** | `0khacha/darija-toxicity-classifier` |
39
+ | **Architecture** | Fine-tuned from `SI2M-Lab/DarijaBERT-arabizi` |
40
+ | **Task** | Binary Sequence Classification (Safe / Toxic) |
41
+ | **Framework** | Hugging Face Transformers |
42
+ | **Training Data** | 16,000+ labeled Moroccan Darija/Arabizi samples |
43
+
44
+ ---
45
+
46
+ ## πŸš€ Quick Inference (Transformers)
47
+
48
+ ```python
49
+ from transformers import pipeline
50
+
51
+ classifier = pipeline(
52
+ "text-classification",
53
+ model="0khacha/darija-toxicity-classifier"
54
+ )
55
+
56
+ result = classifier("salam khouya")
57
+ print(result)
58
+ # Output: [{'label': 'SAFE', 'score': 0.9845}]
59
+ ```
60
+
61
+ ---
62
+
63
+ ## 🧠 What Makes This Model Special?
64
+
65
+ ### 🌍 Dialect-Aware
66
+ Built specifically for Moroccan linguistic patterns β€” not generic Arabic.
67
+
68
+ ### πŸ”’ Arabizi Handling
69
+ Understands numeric character substitutions like:
70
+ * `in3al`
71
+ * `sa7a`
72
+ * `3likom`
73
+
74
+ ### 🧹 Custom Preprocessing
75
+ The model was trained with specialized normalization:
76
+ * Lowercasing
77
+ * Removing dash/underscore splitting (`w-a-l-o` β†’ `walo`)
78
+ * Fixing spaced characters (`n 3 a l` β†’ `n3al`)
79
+ * Reducing elongation (`heeeey` β†’ `hey`)
80
+ * Whitespace normalization
81
+
82
+ ---
83
+
84
+ ## πŸ“Š Performance
85
+
86
+ | Metric | Score |
87
+ |--------|-------|
88
+ | **Accuracy** | ~94% |
89
+ | **F1-Score** | ~93% |
90
+ | **Inference Speed (GPU)** | ~50ms |
91
+
92
+ > **Note:** Performance may vary depending on hardware and deployment setup.
93
+
94
+ ---
95
+
96
+ ## πŸ“– Example Predictions
97
+
98
+ ### Example 1: Safe Content
99
+
100
+ **Input:**
101
+ ```python
102
+ "bghit nakol"
103
+ ```
104
+
105
+ **Output:**
106
+ ```python
107
+ Safe (98.45%)
108
+ ```
109
+
110
+ ### Example 2: Toxic Content
111
+
112
+ **Input:**
113
+ ```python
114
+ "rak stupid"
115
+ ```
116
+
117
+ **Output:**
118
+ ```python
119
+ Toxic
120
+ ```
121
+
122
+ ---
123
+
124
+ ## ⚠️ Limitations
125
+
126
+ * May struggle with extremely rare slang
127
+ * Context-dependent toxicity (sarcasm) may reduce accuracy
128
+ * Not intended for legal or automated moderation without human review
129
+
130
+ ---
131
+
132
+ ## πŸ”’ Dataset & Privacy
133
+
134
+ The training dataset is not publicly available for privacy and ethical reasons.
135
+
136
+ For research collaboration: πŸ“© [mohamedkhacha99@gmail.com](mailto:mohamedkhacha99@gmail.com)
137
+
138
+ ---
139
+
140
+ ## πŸ“œ License
141
+
142
+ MIT License
143
+
144
+ ---
145
+
146
+ ## πŸ™ Acknowledgments
147
+
148
+ * **DarijaBERT team** at SI2M-Lab
149
+ * **Hugging Face** Transformers ecosystem
150
+ * **PyTorch**
151
+ * The **Moroccan NLP community**
152
+
153
+ ---
154
+
155
+ ## πŸ“š Citation
156
+
157
+ If you use this model in your research, please cite:
158
+
159
+ ```bibtex
160
+ @misc{darija-toxicity-classifier,
161
+ author = {Khacha, Mohamed},
162
+ title = {Darija Toxicity Classifier},
163
+ year = {2024},
164
+ publisher = {HuggingFace},
165
+ url = {https://huggingface.co/0khacha/darija-toxicity-classifier}
166
+ }
167
+ ```
168
+
169
+ ---
170
+
171
+ ## 🀝 Contributing
172
+
173
+ Contributions, issues, and feature requests are welcome!
174
+
175
+ Feel free to check the [issues page](https://huggingface.co/0khacha/darija-toxicity-classifier/discussions).
176
+
177
+ ---
178
+
179
+ **Made with ❀️ for the Moroccan NLP community**