license: cc-by-nc-4.0
language:
- ar
pipeline_tag: token-classification
datasets:
- guymorlan/levanti
- community-datasets/tashkeela
Levanti Diacritizer
This model adds diacritics to raw text in Palestinian colloquial Arabic.
The model is trained on a special subset of the Levanti dataset (to be released later).
The model is fine-tuned from Google's CANINE-s character level LM with a multi-label token classification head.
CANINE-s is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 5 epochs on the diacritized subset of the Levanti dataset.
Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun (see model.config.id2label
). A multi-label model is used since a Shadda can accompany other diacritical marks.
Transliterator
This model can be used in conjunction with Levanti Transliterator, which transliterated diacritized text in Palestinian Arabic.
Example Usage
from transformers import CanineForTokenClassification, AutoTokenizer
model = CanineForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")
label2diacritic = {0: 'ู', 1: 'ู', 2: 'ู', 3: 'ู', 4: ''}
def arabic2diacritics(text, model, tokenizer):
tokens = tokenizer(text, return_tensors="pt")
preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove CLS and SEP
new_text = []
for p, c in zip(preds, text):
new_text.append(c)
for i in range(1, 5):
if p[i]:
new_text.append(label2diacritic[i])
# check shadda last
if p[0]:
new_text.append(label2diacritic[0])
new_text = "".join(new_text)
return new_text
text = "ุจุฏูุด ุงุฑูุญ ุนุงูู
ุฏุฑุณุฉ ุจูุฑุง"
arabic2diacritics(text, model, tokenizer)
Attribution
Created by Guy Mor-Lan.
Contact: guy.mor AT mail.huji.ac.il