Getting token embeddings instead of sentence embeddings
Hi all,
i'am trying to load and use this via llama-cpp. i've downloaded the quantized checkpoint and tried running on my system
from llama_cpp import Llama
gguf_embed=Llama(
model_path="./models/embedding_model_qwen3_gguf/Qwen3-Embedding-0.6B-Q8_0.gguf",
embedding=True
)
gembed=gguf_embed.embed("this is just checking")
here,
the len(gembed) i am getting as 4 (the number of words in the tokens and the len(gembed[0]) is 1024 (which is the embedding dimension)
am i missing something? we should get sentence embeddings as output, right?
this is not the case while i am using base model with sentence-transformers
thanks in advance
a) run this model with last_pooling
-snip-
a) run this model with last_pooling
b) don't use these GGUFs, the tokenizer is broken
Which GGUFs should one use? Convert their own?
that was forever ago, disregard. with that being said, i would benchmark the GGUF speed. i recently tested embeddinggemma GGUF and on my 3090 llama.cpp was 10x slower than running the ONNX model on my CPU. you're probably better off using bitsandbytes or torch or something if you want it quantized. also, on my own personal retrieval benchmark, embeddinggemma beats qwen3 .6B by a healthy margin and also qwen3 4B (barely), YMMV
that was forever ago, disregard. with that being said, i would benchmark the GGUF speed. i recently tested embeddinggemma GGUF and on my 3090 llama.cpp was 10x slower than running the ONNX model on my CPU. you're probably better off using bitsandbytes or torch or something if you want it quantized. also, on my own personal retrieval benchmark, embeddinggemma beats qwen3 .6B by a healthy margin and also qwen3 4B (barely), YMMV
Oh thank you for your input I am selecting an embedding model to play with locally with some data I have. I wanted to use Qwen3 .6B but I guess I am gonna switch to embeddinggemma. I really have no idea what to use for inference I just have an GTX 1050. Can you recommend anything for inference If I just want it to work and be fast? I might not even need it to be quantized
that was just for retrieval, but MTEB benchmarks a lot of other stuff....clustering, classification, etc. so depending on your task Qwen may still be better. but embeddinggemma is amazingly good for retrieval