ctoraman commited on
Commit
cf983c2
·
1 Parent(s): 41073f5

readme updated

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -15,8 +15,12 @@ The pretrained corpus is OSCAR's Turkish split, but it is further filtered and c
15
 
16
  Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Character-level, which means that text is split by individual characters. Vocabulary size is 16.7k.
17
 
18
- ## Note that this model does not include a tokenizer file, because it uses ByT5Tokenizer. The following code can be used for tokenization, example max length(1024) can be changed:
19
  ```
 
 
 
 
20
  tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
21
  tokenizer.mask_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][0]
22
  tokenizer.cls_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
 
15
 
16
  Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Character-level, which means that text is split by individual characters. Vocabulary size is 16.7k.
17
 
18
+ ## Note that this model does not include a tokenizer file, because it uses ByT5Tokenizer. The following code can be used for model loading and tokenization, example max length(1024) can be changed:
19
  ```
20
+ model = AutoModel.from_pretrained([model_path])
21
+ #for sequence classification:
22
+ #model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
23
+
24
  tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
25
  tokenizer.mask_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][0]
26
  tokenizer.cls_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]