Update tokenizer_config.json
Browse filesWithout setting `model_max_length`, it breaks with HF's pipeline function when using >512 tokens:
```py
from transformers import pipeline
pipe = pipeline('feature-extraction', 'thenlper/gte-small')
input = "# 2024 Summer Olympics\n\n## The Games\\[edit\\]\n\n### Sports\\[edit\\]\n\nle\"><span>Image</span></span> Basketball<ul><li>Basketball <small>(2)</small></li><li>3×3 basketball <small>(2)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Boxing <small>(13)</small></li><li><span typeof=\"mw:File\"><span>Image</span></span> Breaking <small>(2)</small></li></ul></td><td><ul><li><span typeof=\"mw:File\"><span>Image</span></span> Canoeing<ul><li>Slalom <small>(6)</small></li><li>Sprint <small>(10)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Cycling<ul><li>BMX freestyle <small>(2)</small></li><li>BMX racing <small>(2)</small></li><li>Mountain biking <small>(2)</small></li><li>Road <small>(4)</small></li><li>Track <small>(12)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Equestrian<ul><li>Dressage <small>(2)</small></li><li>Eventing <small>(2)</small></li><li>Jumping <small>(2)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span><a"
output = pipe(input)
```
See https://github.com/xenova/transformers.js/issues/355 for more information. This modification was also made to the Transformers.js-compatible version of the model: https://huggingface.co/Xenova/gte-small/commit/7ca943b8ff118ce9eb87aa3a5669f26e3d633fd7
- tokenizer_config.json +1 -1
@@ -4,7 +4,7 @@
|
|
4 |
"do_basic_tokenize": true,
|
5 |
"do_lower_case": true,
|
6 |
"mask_token": "[MASK]",
|
7 |
-
"model_max_length":
|
8 |
"never_split": null,
|
9 |
"pad_token": "[PAD]",
|
10 |
"sep_token": "[SEP]",
|
|
|
4 |
"do_basic_tokenize": true,
|
5 |
"do_lower_case": true,
|
6 |
"mask_token": "[MASK]",
|
7 |
+
"model_max_length": 512,
|
8 |
"never_split": null,
|
9 |
"pad_token": "[PAD]",
|
10 |
"sep_token": "[SEP]",
|