Code based version next pleast and also why isn't this one and the similarly sized multilingual available through the ollama pulls?

by TimeLordRaps - opened 1 day ago

Discussion

TimeLordRaps

1 day ago

ollama website only lists 30m English and 278m multilingual.

gabegoodhart

IBM Granite org 1 day ago

Hi @TimeLordRaps ! At the moment, we only released the smallest and largest on Ollama to simplify the user story. If there's interest in getting the two middle-sized versions released there as well, we're certainly open to it! For the time being, you can also convert these checkpoints to GGUF and run with Ollama locally:

Convert using convert_hf_to_gguf.py from llama.cpp
- For the english models (roberta architecture), you will need to use the version from this PR as we work to get it merged
- Make sure you have installed the right python requirements from requirements/requirements-convert_hf_to_gguf.txt
```
convert_hf_to_gguf.py /path/to/granite-embedding-125m-english --outfile /path/to/granite-embedding-125m-english/granite-embedding-125m-english.gguf
```
Create a Modelfile pointing at the converted GGUF file
```
FROM /path/to/granite-embedding-125m-english.gguf
```

Import it directly into Ollama

ollama create granite-embedding-local:125m -f Modelfile

pawasthy

IBM Granite org 1 day ago

Hi @TimeLordRaps , this model does well on code too. It is better than most other similar sized models on CoIR benchmark.

TimeLordRaps

about 22 hours ago

Did you investigate different pretraining token quantities to gauge effects of training data scale on performance for embedding models? Based on the training data even slight variations in different parallel pretraining scales could be saved for the last step of pretraining with the most general difficult dataset. Effectively creating a narrow range of pretrained models in pretraining token scale, comparatively to the overall scale. This narrow range would be the first simple easily executable steps for understanding embedding scaling from my perspective that has seemed lacking, or has gone unnoticed more broadly.

TimeLordRaps

about 22 hours ago

The above depends on dataset steps, rather than dataset mixing...

TimeLordRaps

about 22 hours ago

My whole original point on making a code based version was on an expectation that this model is already quite good at code, I dont think there is currently a widely recognizable code embedding model, and was offering that perception subtlely. And so to speak gave even more insight into how I would go about this post-training...

TimeLordRaps

about 21 hours ago

It seems these models based on their comparative performance across benchmarks were in fact trained on lots of code, and are in fact already what I personally would consider code-targeted embedding models, as this seems to be the case, and you value the user story, I think the story here is these are code embedding models... My apologies for having not done deep enough research initially. Can't wait to give these a try in mindcraft.

TimeLordRaps changed discussion status to closed about 21 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment