Hosting NuExtract on VLLM inference Engine

#16
by abhinavjain - opened

I am trying to host NuExtract on VLLM inference engine, But I am getting the following error -

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (12112). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

gpu_memory_utilization is by default set to - 0.90 and I am using g5.2xlarge (EC2 instance) with 24GB of VRAM.

Here are the model loading logs -

image.png

Sign up or log in to comment