Fill-Mask
Transformers
PyTorch
modernbert

can we have some official training / finetuning recipes for this model ?

#11
by StephennFernandes - opened

hi on the latest version of transformers i tried to finetune mmBERT on the text classification tasks:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification

when i tired to use mmBERT as a drop in replacement over the original uncasedBERT, even after several epochs the accuracy is stuck to 0.3 and f1 score is always 0 .

seems like the mmBERT models are not directly compatible with all BERT finetuning techniques at the moment.

would really approeciate if we could get some training / finetuning guidelines and examples so we could use mmBERT in all possible ways we used mBERT or BERT before.

Center for Language and Speech Processing @ JHU org

The evaluations were done with (a slightly older version of) this script and others have already fine-tuned it with the example scripts, so it does work with the right environment. Perhaps it is an issue with the attention function, as I had flash attention installed? I know some ModernBERT models had issues with the backup attention function (sdpa attention) in the past though I thought it was resolved. Try something like pip install "flash_attn==2.6.3" --no-build-isolation or similar and see if it changes it.

@orionweller thanks for responding back. it really means a lot.

Wanted to know how could i continually pretrain the mmBERT model further on more custom pretraining data. are there any resources for this. how do you recommend is the most stable and performant way to further continually pretrain the mmBERT model ?

Center for Language and Speech Processing @ JHU org

It really depends on what you want to pre-train on. You might want to checkout RexBERT or BioClinicalBERT for examples. You will need to gather the pre-training data you want and then continue pre-training from one of the checkpoints in mmbert-checkpoints, likely the decay one.

More details can be found in ModernBERT's repo for the training code or Ettin's repo for the data preparation side.

Sign up or log in to comment