File size: 2,386 Bytes
4eafff8
 
 
 
 
 
 
 
 
 
 
 
 
 
056177e
 
 
4eafff8
 
 
 
 
 
 
056177e
 
4eafff8
 
056177e
 
 
 
 
 
4eafff8
 
 
056177e
4eafff8
 
9de7328
4eafff8
 
 
 
 
 
9de7328
4eafff8
 
 
 
 
 
 
056177e
4eafff8
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
license: cc-by-nc-4.0
language:
- ar
pipeline_tag: token-classification
datasets:
- guymorlan/levanti
- community-datasets/tashkeela
---

# Levanti Diacritizer

This model adds diacritics to raw text in Palestinian colloquial Arabic. 
The model is trained on a special subset of the Levanti dataset (to be released later).
The model is fine-tuned from the [TavBERT-ar](https://huggingface.co/tau/tavbert-ar) character level encoder LM, with a multi-label token classification head.
TavBert-ar is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 8 epochs on the diacritized subset of the Levanti dataset. 
Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun. A multi-label model is used since a Shadda can accompany other diacritical marks.

# Transliterator
This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic.

# Example Usage

```python
from transformers import RobertaForTokenClassification, AutoTokenizer
model = RobertaForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")

label2diacritic = {0: 'ู‘', # SHADDA
                   1: 'ูŽ', # FATHA
                   2: 'ู', # KASRA
                   3: 'ู', # DAMMA
                   4: 'ู’'} # SUKKUN


def arabic2diacritics(text, model, tokenizer):
    tokens = tokenizer(text, return_tensors="pt")
    preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove preds for BOS and EOS
    new_text = []
    for p, c in zip(preds, text):
        new_text.append(c)
        for i in range(1, 5):
            if p[i]:
                new_text.append(label2diacritic[i])
        # check shadda last
        if p[0]:
            new_text.append(label2diacritic[0])
        
    new_text = "".join(new_text)
    return new_text

text = "ุจุฏูŠุด ุงุฑูˆุญ ุนุงู„ู…ุฏุฑุณุฉ ุจูƒุฑุง"
arabic2diacritics(text, model, tokenizer)
```
```
Out[1]: 'ุจูุฏูู‘ูŠู’ุด ุงู’ุฑููˆู’ุญ ุนูŽุงู„ู’ู…ูŽุฏู’ุฑูŽุณูุฉ ุจููƒู’ุฑูŽุง'
```

# Attribution
Created by Guy Mor-Lan.<br>
Contact: guy.mor AT mail.huji.ac.il