Fix typos in README
Browse files
README.md
CHANGED
@@ -24,9 +24,9 @@ The model could only be trained for about `10%` of the whole dataset due to time
|
|
24 |
|
25 |
## Preprocessing and the tokenizer
|
26 |
|
27 |
-
We tried to keep the preprocessing to
|
28 |
|
29 |
-
Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them
|
30 |
|
31 |
The tokenizer was trained on `5%` of the training set, with a vocabulary size of `64'000`.
|
32 |
|
|
|
24 |
|
25 |
## Preprocessing and the tokenizer
|
26 |
|
27 |
+
We tried to keep the preprocessing to a bare minimum. We only replaced URLs, emails and social media user mentions with fixed tokens.
|
28 |
|
29 |
+
Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them part of the vocabulary.
|
30 |
|
31 |
The tokenizer was trained on `5%` of the training set, with a vocabulary size of `64'000`.
|
32 |
|