Hosting NuExtract on VLLM inference Engine

#16

by abhinavjain - opened 20 days ago

Discussion

abhinavjain

20 days ago

•

edited 20 days ago

I am trying to host NuExtract on VLLM inference engine, But I am getting the following error -

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (12112). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

gpu_memory_utilization is by default set to - 0.90 and I am using g5.2xlarge (EC2 instance) with 24GB of VRAM.

Here are the model loading logs -

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment