Add `"add_prefix_space": true,`; this allows for much stronger token-level performance (e.g. NER, ColBERT)
Hello!
Pull Request overview
- Add
"add_prefix_space": true,
to the tokenizer config
Details
This allows for much stronger token-level performance (e.g. NER, ColBERT), because otherwise each token will not be prepended by a space, while our model is trained with data where each token is prepended by a space.
We will need to explain that users can set add_prefix_space
to False in the model card somewhere.
cc
@bclavie
@bwarner
@NohTow
could one of you take care of that?
P.s. feel free to hold off on merging for now, this PR can also be used to run some tests first (with revision="refs/pr/..."
).
Note that you need to use transformers
after https://github.com/huggingface/transformers/pull/35593 was merged.
- Tom Aarsen
I used this quick script:
from transformers import AutoTokenizer
tok_no_space = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base", add_prefix_space=False, revision="refs/pr/48")
tok_space = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base", add_prefix_space=True, revision="refs/pr/48")
tok_base = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base", revision="refs/pr/48")
tokenizers = [tok_no_space, tok_space, tok_base]
tokenizers_names = ["add_prefix_space=False", "add_prefix_space=True", "neither (i.e. uses the config)"]
text = "I am a test sentence."
for tok, name in zip(tokenizers, tokenizers_names):
print(f"Tokenizing with {name}")
tokenized = tok(text)
print(tokenized)
original = tok.decode(tokenized["input_ids"])
print(original)
tokens = tok.batch_decode(tokenized["input_ids"])
print(tokens)
print()
Which results in:
Before:
Tokenizing with add_prefix_space=False
{'input_ids': [50281, 42, 717, 247, 1071, 6197, 15, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
[CLS]I am a test sentence.[SEP]
['[CLS]', 'I', ' am', ' a', ' test', ' sentence', '.', '[SEP]']
Tokenizing with add_prefix_space=True
{'input_ids': [50281, 309, 717, 247, 1071, 6197, 15, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] I am a test sentence.[SEP]
['[CLS]', ' I', ' am', ' a', ' test', ' sentence', '.', '[SEP]']
Tokenizing with neither (i.e. uses the config)
{'input_ids': [50281, 42, 717, 247, 1071, 6197, 15, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
[CLS]I am a test sentence.[SEP]
['[CLS]', 'I', ' am', ' a', ' test', ' sentence', '.', '[SEP]']
And with revision="refs/pr/48"
:
Tokenizing with add_prefix_space=False
{'input_ids': [50281, 42, 717, 247, 1071, 6197, 15, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
[CLS]I am a test sentence.[SEP]
['[CLS]', 'I', ' am', ' a', ' test', ' sentence', '.', '[SEP]']
Tokenizing with add_prefix_space=True
{'input_ids': [50281, 309, 717, 247, 1071, 6197, 15, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] I am a test sentence.[SEP]
['[CLS]', ' I', ' am', ' a', ' test', ' sentence', '.', '[SEP]']
Tokenizing with neither (i.e. uses the config)
{'input_ids': [50281, 309, 717, 247, 1071, 6197, 15, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] I am a test sentence.[SEP]
['[CLS]', ' I', ' am', ' a', ' test', ' sentence', '.', '[SEP]']
Note that with revision refs/pr/48
, a space is automatically added before the I
, and that it can still be disable with add_prefix_space=False
.
- Tom Aarsen