question on max_seq_length

by botkop - opened Oct 14, 2024

Discussion

botkop

Oct 14, 2024

•

edited Oct 14, 2024

Does this model have the same max_seq_length as LaBSE (256) or can you go beyond this?
Thank you.

Pringled

The Minish Lab org Oct 14, 2024

Hi, this model does not have a max_seq_len limit. It has static embeddings, so you can process documents of arbitrary length with it. If you want do do this, please set max_length to None, e.g. embeddings = model.encode(["Example sentence"], max_length=None) and it will process whatever length your input is.

botkop

Oct 14, 2024

Thank you for the quick response.
Will this affect the quality of the embedding?

Pringled

The Minish Lab org Oct 14, 2024

That's hard to say; we have not done extensive experiments yet on long documents, most of our benchmarks were for documents < 512 tokens (MTEB). We do plan on experimenting with this in the future

do-me

Oct 30, 2024

It does affect the quality. Very long input texts with millions of tokens lead to almost useless embeddings (like with normal models, the longer the input, the poorer the quality), wrote a little bit about it here in the comments section: https://www.linkedin.com/posts/dominik-weckm%C3%BCller_from-days-to-seconds-creating-embeddings-activity-7255095750496321537-WwI2?utm_source=share&utm_medium=member_desktop. Will write about my findings soon.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment