File size: 2,386 Bytes
4eafff8 056177e 4eafff8 056177e 4eafff8 056177e 4eafff8 056177e 4eafff8 9de7328 4eafff8 9de7328 4eafff8 056177e 4eafff8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
---
license: cc-by-nc-4.0
language:
- ar
pipeline_tag: token-classification
datasets:
- guymorlan/levanti
- community-datasets/tashkeela
---
# Levanti Diacritizer
This model adds diacritics to raw text in Palestinian colloquial Arabic.
The model is trained on a special subset of the Levanti dataset (to be released later).
The model is fine-tuned from the [TavBERT-ar](https://huggingface.co/tau/tavbert-ar) character level encoder LM, with a multi-label token classification head.
TavBert-ar is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 8 epochs on the diacritized subset of the Levanti dataset.
Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun. A multi-label model is used since a Shadda can accompany other diacritical marks.
# Transliterator
This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic.
# Example Usage
```python
from transformers import RobertaForTokenClassification, AutoTokenizer
model = RobertaForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")
label2diacritic = {0: 'ู', # SHADDA
1: 'ู', # FATHA
2: 'ู', # KASRA
3: 'ู', # DAMMA
4: 'ู'} # SUKKUN
def arabic2diacritics(text, model, tokenizer):
tokens = tokenizer(text, return_tensors="pt")
preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove preds for BOS and EOS
new_text = []
for p, c in zip(preds, text):
new_text.append(c)
for i in range(1, 5):
if p[i]:
new_text.append(label2diacritic[i])
# check shadda last
if p[0]:
new_text.append(label2diacritic[0])
new_text = "".join(new_text)
return new_text
text = "ุจุฏูุด ุงุฑูุญ ุนุงูู
ุฏุฑุณุฉ ุจูุฑุง"
arabic2diacritics(text, model, tokenizer)
```
```
Out[1]: 'ุจูุฏููููุด ุงูุฑูููุญ ุนูุงููู
ูุฏูุฑูุณูุฉ ุจูููุฑูุง'
```
# Attribution
Created by Guy Mor-Lan.<br>
Contact: guy.mor AT mail.huji.ac.il |