GGUF
conversational

Getting token embeddings instead of sentence embeddings

#8
by cicada330117 - opened

Hi all,

i'am trying to load and use this via llama-cpp. i've downloaded the quantized checkpoint and tried running on my system

from llama_cpp import Llama
gguf_embed=Llama(
model_path="./models/embedding_model_qwen3_gguf/Qwen3-Embedding-0.6B-Q8_0.gguf",
embedding=True
)

gembed=gguf_embed.embed("this is just checking")

here,

the len(gembed) i am getting as 4 (the number of words in the tokens and the len(gembed[0]) is 1024 (which is the embedding dimension)

am i missing something? we should get sentence embeddings as output, right?

this is not the case while i am using base model with sentence-transformers

thanks in advance

a) run this model with last_pooling

-snip-

a) run this model with last_pooling

b) don't use these GGUFs, the tokenizer is broken

Which GGUFs should one use? Convert their own?

that was forever ago, disregard. with that being said, i would benchmark the GGUF speed. i recently tested embeddinggemma GGUF and on my 3090 llama.cpp was 10x slower than running the ONNX model on my CPU. you're probably better off using bitsandbytes or torch or something if you want it quantized. also, on my own personal retrieval benchmark, embeddinggemma beats qwen3 .6B by a healthy margin and also qwen3 4B (barely), YMMV

that was forever ago, disregard. with that being said, i would benchmark the GGUF speed. i recently tested embeddinggemma GGUF and on my 3090 llama.cpp was 10x slower than running the ONNX model on my CPU. you're probably better off using bitsandbytes or torch or something if you want it quantized. also, on my own personal retrieval benchmark, embeddinggemma beats qwen3 .6B by a healthy margin and also qwen3 4B (barely), YMMV

Oh thank you for your input I am selecting an embedding model to play with locally with some data I have. I wanted to use Qwen3 .6B but I guess I am gonna switch to embeddinggemma. I really have no idea what to use for inference I just have an GTX 1050. Can you recommend anything for inference If I just want it to work and be fast? I might not even need it to be quantized

that was just for retrieval, but MTEB benchmarks a lot of other stuff....clustering, classification, etc. so depending on your task Qwen may still be better. but embeddinggemma is amazingly good for retrieval

Sign up or log in to comment