I'm getting max 512 tokens error for BAAI bge-large-en-v1.5. What needs to changed here?
#22
by
michael-newsrx-com
- opened
I'm trying to implement embedding using a dedicated inference endpoint using the following code. The code is based on automatic_embedding_tei_inference_endpoints#inference-endpoints.
ep = create_inference_endpoint( #
ep_name, #
repository="BAAI/bge-large-en-v1.5", #
framework="pytorch", #
accelerator="gpu", #
instance_size="x1", #
instance_type="nvidia-l4", #
region="us-east-1", #
vendor="aws", #
min_replica=0, #
max_replica=1, #
task="sentence-embeddings", #
type=InferenceEndpointType.PROTECTED, #
namespace="newsrx", #
custom_image={ #
"health_route": "/health", #
"url": "ghcr.io/huggingface/text-embeddings-inference:1.5.0", #
"env": { #
"MAX_BATCH_TOKENS": "16384", #
"MAX_CONCURRENT_REQUESTS": "512", #
"MODEL_ID": "/repository", #
"QUANTIZE": "eetq", #
}, #
})
The error is:
message: "Input validation error: `inputs` must have less than 512 tokens. Given: 785"
target: "text_embeddings_core::infer"
filename: "core/src/infer.rs"
line_number: 332
span: {"normalize":true,"prompt_name":"None","truncate":false,"truncation_direction":"Right","name":"embed_pooled"}
spans: [{"name":"embed"},{"normalize":true,"prompt_name":"None","truncate":false,"truncation_direction":"Right","name":"embed_pooled"}]
See also: https://github.com/huggingface/text-embeddings-inference/issues/356#issuecomment-2449867518
If you pass truncate=True
in the payload, it will automatically truncate the input and you won't have this issue.
A better chunking strategy might also be useful rather than simple truncation