Fill-Mask
Transformers
PyTorch
modernbert

Update Idea: Train with Gemma3 Tokenizer?

#12
by mxngjxa - opened

I noticed that the model was with Gemma2's tokenizers, but since the release of google/gemma-3-270m or any gemma-3-x variant, would it not make sense to update this model with the updated Gemma3 Tokenizer layers? It would be able to scale this model to 128k tokens

mxngjxa changed discussion title from Update Idea: Train with Gemma3? to Update Idea: Train with Gemma3 Tokenizer?
Center for Language and Speech Processing @ JHU org

Hey @mxngjxa ! I think there are two parts to your question, both are great ideas but non-trivial to do:

(1) updating the tokenizer from Gemma2 to Gemma3. This would be awesome, but sadly quite hard to do since the model is already trained and has learned the original vocabulary. If someone wanted to pre-train from scratch again though, I would recommend Gemma3's tokenizer and that is a great idea.

(2) Re: 128k token context length, that was done via the Gemma post-training group with their proprietary long context data. Sadly we don't have that data, but if someone were to collect 128k length long context training data you could easily adapt our model to that size context length like we did with 8k length data.

Sign up or log in to comment