Code based version next pleast and also why isn't this one and the similarly sized multilingual available through the ollama pulls?

#1
by TimeLordRaps - opened

ollama website only lists 30m English and 278m multilingual.

IBM Granite org

Hi @TimeLordRaps ! At the moment, we only released the smallest and largest on Ollama to simplify the user story. If there's interest in getting the two middle-sized versions released there as well, we're certainly open to it! For the time being, you can also convert these checkpoints to GGUF and run with Ollama locally:

  • Convert using convert_hf_to_gguf.py from llama.cpp

    convert_hf_to_gguf.py /path/to/granite-embedding-125m-english --outfile /path/to/granite-embedding-125m-english/granite-embedding-125m-english.gguf
    
  • Create a Modelfile pointing at the converted GGUF file

    FROM /path/to/granite-embedding-125m-english.gguf
    
  • Import it directly into Ollama

    ollama create granite-embedding-local:125m -f Modelfile
    
IBM Granite org

Hi @TimeLordRaps , this model does well on code too. It is better than most other similar sized models on CoIR benchmark.

Did you investigate different pretraining token quantities to gauge effects of training data scale on performance for embedding models? Based on the training data even slight variations in different parallel pretraining scales could be saved for the last step of pretraining with the most general difficult dataset. Effectively creating a narrow range of pretrained models in pretraining token scale, comparatively to the overall scale. This narrow range would be the first simple easily executable steps for understanding embedding scaling from my perspective that has seemed lacking, or has gone unnoticed more broadly.

The above depends on dataset steps, rather than dataset mixing...

My whole original point on making a code based version was on an expectation that this model is already quite good at code, I dont think there is currently a widely recognizable code embedding model, and was offering that perception subtlely. And so to speak gave even more insight into how I would go about this post-training...

It seems these models based on their comparative performance across benchmarks were in fact trained on lots of code, and are in fact already what I personally would consider code-targeted embedding models, as this seems to be the case, and you value the user story, I think the story here is these are code embedding models... My apologies for having not done deep enough research initially. Can't wait to give these a try in mindcraft.

TimeLordRaps changed discussion status to closed

Sign up or log in to comment