mlx-community/DeepSeek-R1-Distill-Qwen-32B-6bit
Updated
•
8.92k
•
1
I'd just start with modernBert large though, easier and strong base. Less faffing about. Also big vocab <3
They do PCA (prior to the zipf weighting) and explicitly state that they found that it improved perf.
Did you try potion/m2v as a starting point? (nvm modernbert, and it's much larger vocab)?
This is really cool! I'm surprised you do better than model2vec - is the difference really just the use of a (better) contrastive loss pretraining formula?