readme updated
Browse files
README.md
CHANGED
@@ -15,8 +15,12 @@ The pretrained corpus is OSCAR's Turkish split, but it is further filtered and c
|
|
15 |
|
16 |
Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Character-level, which means that text is split by individual characters. Vocabulary size is 16.7k.
|
17 |
|
18 |
-
## Note that this model does not include a tokenizer file, because it uses ByT5Tokenizer. The following code can be used for tokenization, example max length(1024) can be changed:
|
19 |
```
|
|
|
|
|
|
|
|
|
20 |
tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
|
21 |
tokenizer.mask_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][0]
|
22 |
tokenizer.cls_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
|
|
|
15 |
|
16 |
Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Character-level, which means that text is split by individual characters. Vocabulary size is 16.7k.
|
17 |
|
18 |
+
## Note that this model does not include a tokenizer file, because it uses ByT5Tokenizer. The following code can be used for model loading and tokenization, example max length(1024) can be changed:
|
19 |
```
|
20 |
+
model = AutoModel.from_pretrained([model_path])
|
21 |
+
#for sequence classification:
|
22 |
+
#model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
|
23 |
+
|
24 |
tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
|
25 |
tokenizer.mask_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][0]
|
26 |
tokenizer.cls_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
|