File size: 1,493 Bytes
1a7880d
 
dc51eaf
1a7880d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c84d2b
1a7880d
82cfb59
1a7880d
 
cddc983
1a7880d
594910b
1a7880d
 
 
 
 
 
b460f13
 
1a7880d
 
 
 
 
c09f082
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
language: 
- dza

tags:
- pytorch
- bert
- ar
- dz

license: apache-2.0

widget:
- text: " أنا من الجزائر من ولاية [MASK] "
- text: " ربي [MASK] خويا لعزيز"

inference: true
---



# Dzarabert


DzarbiBert is a pruned model of first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect ([DziriBERT](https://huggingface.co/alger-ia/dziribert)). This pruned version handles Algerian text contents written using Arabic letters. It sets new state of the art results on Algerian text classification datasets, even if it has been pre-trained on much less data (~1 million tweets).

For more information, please visit the paper of the base model: https://arxiv.org/pdf/2109.12346.pdf.

## How to use

```python
from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained("Sifal/dzarabert ")
model = BertForMaskedLM.from_pretrained("Sifal/dzarabert ")

```

## Limitations

The pre-training data used in the base model comes from social media (Twitter). Therefore, the Masked Language Modeling objective may predict offensive words in some situations. Modeling this kind of words may be either an advantage (e.g. when training a hate speech model) or a disadvantage (e.g. when generating answers that are directly sent to the end user). Depending on your downstream task, you may need to filter out such words especially when returning automatically generated text to the end user.