guymorlan
/

levanti_arabic2diacritics

@@ -12,9 +12,9 @@ datasets:
 This model adds diacritics to raw text in Palestinian colloquial Arabic.
 The model is trained on a special subset of the Levanti dataset (to be released later).
-The model is fine-tuned from Google's [CANINE-s](https://huggingface.co/google/canine-s) character level LM with a multi-label token classification head.
-CANINE-s is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 5 epochs on the diacritized subset of the Levanti dataset.
-Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun (see `model.config.id2label`). A multi-label model is used since a Shadda can accompany other diacritical marks.
 # Transliterator
 This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic.
@@ -22,15 +22,20 @@ This model can be used in conjunction with [Levanti Transliterator](https://hugg
 # Example Usage
 ```python
-from transformers import CanineForTokenClassification, AutoTokenizer
-model = CanineForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
 tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")
-label2diacritic = {0: 'ّ', 1: 'َ', 2: 'ِ', 3: 'ُ', 4: ''}
 def arabic2diacritics(text, model, tokenizer):
     tokens = tokenizer(text, return_tensors="pt")
-    preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove CLS and SEP
     new_text = []
     for p, c in zip(preds, text):
         new_text.append(c)
@@ -48,6 +53,7 @@ text = "بديش اروح عالمدرسة بكرا"
 arabic2diacritics(text, model, tokenizer)
 ```
 ```
 ```
 # Attribution

 This model adds diacritics to raw text in Palestinian colloquial Arabic.
 The model is trained on a special subset of the Levanti dataset (to be released later).
+The model is fine-tuned from the [TavBERT-ar](https://huggingface.co/tau/tavbert-ar) character level encoder LM, with a multi-label token classification head.
+TavBert-ar is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 8 epochs on the diacritized subset of the Levanti dataset.
+Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun. A multi-label model is used since a Shadda can accompany other diacritical marks.
 # Transliterator
 This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic.
 # Example Usage
 ```python
+from transformers import RobertaForTokenClassification, AutoTokenizer
+model = RobertaForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
 tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")
+label2diacritic = {0: 'ّ', # SHADDA
+                   1: 'َ', # FATHA
+                   2: 'ِ', # KASRA
+                   3: 'ُ', # DAMMA
+                   4: 'ْ'} # SUKKUN
 def arabic2diacritics(text, model, tokenizer):
     tokens = tokenizer(text, return_tensors="pt")
+    preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove preds for BOS and EOS
     new_text = []
     for p, c in zip(preds, text):
         new_text.append(c)
 arabic2diacritics(text, model, tokenizer)
 ```
 ```
+Out[1]: 'بِدِّيْش اْرُوْح عَالْمَدْرَسِة بُكْرَا'
 ```
 # Attribution