max_seq_length

by yjoonjang - opened Dec 5, 2024

Dec 5, 2024

What is the max_seq_length of this model?
https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0#using-huggingface-transformers
the large model code says max_length=512,
https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0#using-huggingface-transformers
but the medium model code says max_length=8192.

Is it right that their max_seq_lengths are different?

tomaarsen

Dec 5, 2024

I believe this is correct, it's based on the maximum sequence lengths of the respective base models.

yjoonjang

Dec 5, 2024

I believe this is correct, it's based on the maximum sequence lengths of the respective base models.

You mean 512?

tomaarsen

Dec 5, 2024

•

edited Dec 5, 2024

Yes, large should have a maximum sequence length of 512 tokens, and medium a maximum sequence length of 8192. Folks from Snowflake should be able to confirm.

yjoonjang

Dec 5, 2024

But when I run the following code:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0")
print(model.max_seq_length)

I get 8192.

tomaarsen

Dec 5, 2024

Oh, you're right. That's due to https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0/blob/main/tokenizer_config.json#L50
cc @spacemanidol @pxyu I'm pretty sure it's not possible for a XLM-RoBERTa finetune to exceed 512 tokens unless you've updated the positional embedding matrix.

tomaarsen

Dec 5, 2024

Nevermind, looks like they can actually process ~6k tokens. These is the shape the token embeddings of 2 queries: torch.Size([2, 6005, 1024]). Perhaps the max. sequence length is actually 8192 - apologies for the confusion, I'll let the Snowflake team answer.

lukemerrick

Dec 5, 2024

•

edited Dec 5, 2024

Both models handle 8192. We use the adjusted version of XMLR provided by the BGE team (BAAI/bge-m3-retromae), which has been extended for 8k context support, so the normal XMLR rules don't appl, haha. Let me get a fix in for the erroneous large model example code!

spacemanidol

Dec 6, 2024

updated in README so closing.

spacemanidol changed discussion status to closed Dec 6, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment