guymorlan commited on
Commit
4eafff8
โ€ข
1 Parent(s): bb0243e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -1,3 +1,54 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - ar
5
+ pipeline_tag: token-classification
6
+ datasets:
7
+ - guymorlan/levanti
8
+ - community-datasets/tashkeela
9
+ ---
10
+
11
+ # Levanti Diacritizer
12
+
13
+ This model adds diacritics to raw text in Palestinian colloquial Arabic.
14
+ The model is trained on a special subset of the Levanti dataset (to be released later).
15
+ The model is fine-tuned from Google's [CANINE-s](https://huggingface.co/google/canine-s) character level LM with a multi-label token classification head.
16
+ CANINE-s is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 5 epochs on the diacritized subset of the Levanti dataset.
17
+ Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun (see `model.config.id2label`). A multi-label model is used since a Shadda can accompany other diacritical marks.
18
+
19
+ # Transliterator
20
+ This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic.
21
+
22
+ # Example Usage
23
+
24
+ ```python
25
+ from transformers import CanineForTokenClassification, AutoTokenizer
26
+ model = CanineForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
27
+ tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")
28
+
29
+ label2diacritic = {0: 'ู‘', 1: 'ูŽ', 2: 'ู', 3: 'ู', 4: ''}
30
+
31
+ def arabic2diacritics(text, model, tokenizer):
32
+ tokens = tokenizer(text, return_tensors="pt")
33
+ preds = (model(**tokens).logits.sigmoid() > 0.5)[0]
34
+ new_text = []
35
+ for p, c in zip(preds, text):
36
+ for i in range(1, 5):
37
+ if p[i]:
38
+ new_text.append(label2diacritic[i])
39
+ # check shadda last
40
+ if p[0]:
41
+ new_text.append(label2diacritic[0])
42
+ new_text.append(c)
43
+ new_text = "".join(new_text)
44
+ return new_text
45
+
46
+ text = "ุจุฏูŠุด ุงุฑูˆุญ ุนุงู„ู…ุฏุฑุณุฉ ุจูƒุฑุง"
47
+ arabic2diacritics(text, model, tokenizer)
48
+ ```
49
+ ```
50
+ ```
51
+
52
+ # Attribution
53
+ Created by Guy Mor-Lan.<br>
54
+ Contact: guy.mor AT mail.huji.ac.il