I'm getting max 512 tokens error for BAAI bge-large-en-v1.5. What needs to changed here?

#22
by michael-newsrx-com - opened

I'm trying to implement embedding using a dedicated inference endpoint using the following code. The code is based on automatic_embedding_tei_inference_endpoints#inference-endpoints.

    ep = create_inference_endpoint(  #
            ep_name,  #
            repository="BAAI/bge-large-en-v1.5",  #
            framework="pytorch",  #
            accelerator="gpu",  #
            instance_size="x1",  #
            instance_type="nvidia-l4",  #
            region="us-east-1",  #
            vendor="aws",  #
            min_replica=0,  #
            max_replica=1,  #
            task="sentence-embeddings",  #
            type=InferenceEndpointType.PROTECTED,  #
            namespace="newsrx",  #
            custom_image={  #
                "health_route": "/health",  #
                "url": "ghcr.io/huggingface/text-embeddings-inference:1.5.0",  #
                "env": {  #
                    "MAX_BATCH_TOKENS": "16384",  #
                    "MAX_CONCURRENT_REQUESTS": "512",  #
                    "MODEL_ID": "/repository",  #
                    "QUANTIZE": "eetq",  #
                },  #
            })

The error is:

message: "Input validation error: `inputs` must have less than 512 tokens. Given: 785"
target: "text_embeddings_core::infer"
filename: "core/src/infer.rs"
line_number: 332
span: {"normalize":true,"prompt_name":"None","truncate":false,"truncation_direction":"Right","name":"embed_pooled"}
spans: [{"name":"embed"},{"normalize":true,"prompt_name":"None","truncate":false,"truncation_direction":"Right","name":"embed_pooled"}]

See also: https://github.com/huggingface/text-embeddings-inference/issues/356#issuecomment-2449867518

If you pass truncate=True in the payload, it will automatically truncate the input and you won't have this issue.

A better chunking strategy might also be useful rather than simple truncation

Sign up or log in to comment