Deployment on HF Text Embeding Router Fails
I have tried to use the model with text embedding inference by huggingface but after the model loading it fails on a T4 GPU with the following Error, any ideas?
2024-08-14T07:56:19.842147Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2024-08-14T07:56:19.842217Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:37: Model artifacts downloaded in 71.018µs
2024-08-14T07:56:19.850843Z INFO text_embeddings_router: router/src/lib.rs:169: Maximum number of tokens per request: 8192
2024-08-14T07:56:19.851030Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:23: Starting 4 tokenization workers
2024-08-14T07:56:19.858941Z INFO text_embeddings_router: router/src/lib.rs:194: Starting model backend
2024-08-14T07:56:20.476252Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:167: Starting JinaBertModel model on Cuda(CudaDevice(DeviceId(1)))
Error: Could not create backend
Caused by:
Could not start backend: cannot find tensor encoder.layer.0.mlp.gated_layers.weight
On HF TEI itself it gives the following error (online deployment):
{"timestamp":"2024-08-14T08:04:17.082662Z","level":"INFO","message":"Args { model_id: "/rep****ory", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "r-stefanraab-german-semantic-v3-znq-ffyjb6zd-101c8-dneb1", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/repository/cache"), payload_limit: 2000000, api_key: None, json_output: true, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }","target":"text_embeddings_router","filename":"router/src/main.rs","line_number":175}
{"timestamp":"2024-08-14T08:04:17.095519Z","level":"INFO","message":"Maximum number of tokens per request: 8192","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":199}
{"timestamp":"2024-08-14T08:04:17.095687Z","level":"INFO","message":"Starting 2 tokenization workers","target":"text_embeddings_core::tokenization","filename":"core/src/tokenization.rs","line_number":26}
{"timestamp":"2024-08-14T08:04:17.109235Z","level":"INFO","message":"Starting model backend","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":250}
{"timestamp":"2024-08-14T08:04:17.296077Z","level":"INFO","message":"Starting Bert model on Cuda(CudaDevice(DeviceId(1)))","target":"text_embeddings_backend_candle","filename":"backends/candle/src/lib.rs","line_number":268}
Error: Could not create backend
Caused by:
Could not start backend: Bert only supports absolute position embeddings
Hi ! Ah I see it says Bert only supports absolute position embeddings. Alibi is basically relative. Have you anywhere parsed „trust_remote_code=True“? If so, maybe change it to false or vise versa.
If you have >512 tokens it might decrease in quality slightly I guess.
Else open a ticket on TEI to support alibi.
This might help you. They use the same code as I am for alibi / positional embeddings
https://github.com/huggingface/text-embeddings-inference/pull/292