Hosting NuExtract on VLLM inference Engine
#16
by
abhinavjain
- opened
I am trying to host NuExtract on VLLM inference engine, But I am getting the following error -
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (12112). Try increasing gpu_memory_utilization
or decreasing max_model_len
when initializing the engine.
gpu_memory_utilization is by default set to - 0.90 and I am using g5.2xlarge (EC2 instance) with 24GB of VRAM.
Here are the model loading logs -