Getting token embeddings instead of sentence embeddings

by cicada330117 - opened Jun 7, 2025

Jun 7, 2025

Hi all,

i'am trying to load and use this via llama-cpp. i've downloaded the quantized checkpoint and tried running on my system

from llama_cpp import Llama
gguf_embed=Llama(
model_path="./models/embedding_model_qwen3_gguf/Qwen3-Embedding-0.6B-Q8_0.gguf",
embedding=True
)

gembed=gguf_embed.embed("this is just checking")

here,

the len(gembed) i am getting as 4 (the number of words in the tokens and the len(gembed[0]) is 1024 (which is the embedding dimension)

am i missing something? we should get sentence embeddings as output, right?

this is not the case while i am using base model with sentence-transformers

thanks in advance

electroglyph

Jun 19, 2025

•

edited 10 days ago

a) run this model with last_pooling

-snip-

Gwriiuuu

11 days ago

a) run this model with last_pooling

b) don't use these GGUFs, the tokenizer is broken

Which GGUFs should one use? Convert their own?

electroglyph

10 days ago

that was forever ago, disregard. with that being said, i would benchmark the GGUF speed. i recently tested embeddinggemma GGUF and on my 3090 llama.cpp was 10x slower than running the ONNX model on my CPU. you're probably better off using bitsandbytes or torch or something if you want it quantized. also, on my own personal retrieval benchmark, embeddinggemma beats qwen3 .6B by a healthy margin and also qwen3 4B (barely), YMMV

Gwriiuuu

10 days ago

•

edited 10 days ago

that was forever ago, disregard. with that being said, i would benchmark the GGUF speed. i recently tested embeddinggemma GGUF and on my 3090 llama.cpp was 10x slower than running the ONNX model on my CPU. you're probably better off using bitsandbytes or torch or something if you want it quantized. also, on my own personal retrieval benchmark, embeddinggemma beats qwen3 .6B by a healthy margin and also qwen3 4B (barely), YMMV

Oh thank you for your input I am selecting an embedding model to play with locally with some data I have. I wanted to use Qwen3 .6B but I guess I am gonna switch to embeddinggemma. I really have no idea what to use for inference I just have an GTX 1050. Can you recommend anything for inference If I just want it to work and be fast? I might not even need it to be quantized

electroglyph

10 days ago

that was just for retrieval, but MTEB benchmarks a lot of other stuff....clustering, classification, etc. so depending on your task Qwen may still be better. but embeddinggemma is amazingly good for retrieval

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment