Xenova HF staff commited on
Commit
15f6762
1 Parent(s): c20abe8

Update tokenizer_config.json

Browse files

Without setting `model_max_length`, it breaks with HF's pipeline function when using >512 tokens:
```py
from transformers import pipeline
pipe = pipeline('feature-extraction', 'thenlper/gte-small')
input = "# 2024 Summer Olympics\n\n## The Games\\[edit\\]\n\n### Sports\\[edit\\]\n\nle\"><span>Image</span></span> Basketball<ul><li>Basketball <small>(2)</small></li><li>3×3 basketball <small>(2)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Boxing <small>(13)</small></li><li><span typeof=\"mw:File\"><span>Image</span></span> Breaking <small>(2)</small></li></ul></td><td><ul><li><span typeof=\"mw:File\"><span>Image</span></span> Canoeing<ul><li>Slalom <small>(6)</small></li><li>Sprint <small>(10)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Cycling<ul><li>BMX freestyle <small>(2)</small></li><li>BMX racing <small>(2)</small></li><li>Mountain biking <small>(2)</small></li><li>Road <small>(4)</small></li><li>Track <small>(12)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Equestrian<ul><li>Dressage <small>(2)</small></li><li>Eventing <small>(2)</small></li><li>Jumping <small>(2)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span><a"
output = pipe(input)
```

See https://github.com/xenova/transformers.js/issues/355 for more information. This modification was also made to the Transformers.js-compatible version of the model: https://huggingface.co/Xenova/gte-small/commit/7ca943b8ff118ce9eb87aa3a5669f26e3d633fd7

Files changed (1) hide show
  1. tokenizer_config.json +1 -1
tokenizer_config.json CHANGED
@@ -4,7 +4,7 @@
4
  "do_basic_tokenize": true,
5
  "do_lower_case": true,
6
  "mask_token": "[MASK]",
7
- "model_max_length": 1000000000000000019884624838656,
8
  "never_split": null,
9
  "pad_token": "[PAD]",
10
  "sep_token": "[SEP]",
 
4
  "do_basic_tokenize": true,
5
  "do_lower_case": true,
6
  "mask_token": "[MASK]",
7
+ "model_max_length": 512,
8
  "never_split": null,
9
  "pad_token": "[PAD]",
10
  "sep_token": "[SEP]",