Suggested vllm options

#1
by vlzft - opened

Hi! With a single H100 vllm runs out of GPU memory when trying to use this with simply vllm serve cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic.

What are the suggested options to use β€” the description says this should be optimized for running on a single H100.

Cortecs org
β€’
edited 5 days ago

Hi! Apologies for the confusion, and thank you for bringing this to our attention. The model is indeed optimized for a single H100, but two parameters were missing in the instructions. Please use --max-model-len 9000 and set --gpu-memory-util 0.95 when serving the model with vllm. We've updated the information to include these details. Let us know if you have any further questions!

Sign up or log in to comment