|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- ar |
|
pipeline_tag: token-classification |
|
datasets: |
|
- guymorlan/levanti |
|
- community-datasets/tashkeela |
|
--- |
|
|
|
# Levanti Diacritizer |
|
|
|
This model adds diacritics to raw text in Palestinian colloquial Arabic. |
|
The model is trained on a special subset of the Levanti dataset (to be released later). |
|
The model is fine-tuned from the [TavBERT-ar](https://huggingface.co/tau/tavbert-ar) character level encoder LM, with a multi-label token classification head. |
|
TavBert-ar is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 8 epochs on the diacritized subset of the Levanti dataset. |
|
Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun. A multi-label model is used since a Shadda can accompany other diacritical marks. |
|
|
|
# Transliterator |
|
This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic. |
|
|
|
# Example Usage |
|
|
|
```python |
|
from transformers import RobertaForTokenClassification, AutoTokenizer |
|
model = RobertaForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics") |
|
tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics") |
|
|
|
label2diacritic = {0: 'ู', # SHADDA |
|
1: 'ู', # FATHA |
|
2: 'ู', # KASRA |
|
3: 'ู', # DAMMA |
|
4: 'ู'} # SUKKUN |
|
|
|
|
|
def arabic2diacritics(text, model, tokenizer): |
|
tokens = tokenizer(text, return_tensors="pt") |
|
preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove preds for BOS and EOS |
|
new_text = [] |
|
for p, c in zip(preds, text): |
|
new_text.append(c) |
|
for i in range(1, 5): |
|
if p[i]: |
|
new_text.append(label2diacritic[i]) |
|
# check shadda last |
|
if p[0]: |
|
new_text.append(label2diacritic[0]) |
|
|
|
new_text = "".join(new_text) |
|
return new_text |
|
|
|
text = "ุจุฏูุด ุงุฑูุญ ุนุงูู
ุฏุฑุณุฉ ุจูุฑุง" |
|
arabic2diacritics(text, model, tokenizer) |
|
``` |
|
``` |
|
Out[1]: 'ุจูุฏููููุด ุงูุฑูููุญ ุนูุงููู
ูุฏูุฑูุณูุฉ ุจูููุฑูุง' |
|
``` |
|
|
|
# Attribution |
|
Created by Guy Mor-Lan.<br> |
|
Contact: guy.mor AT mail.huji.ac.il |