Suggested vllm options

by vlzft - opened Dec 13, 2024

Dec 13, 2024

Hi! With a single H100 vllm runs out of GPU memory when trying to use this with simply vllm serve cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic.

What are the suggested options to use — the description says this should be optimized for running on a single H100.

Snoopy04

Cortecs org Dec 13, 2024

•

edited Dec 13, 2024

Hi! Apologies for the confusion, and thank you for bringing this to our attention. The model is indeed optimized for a single H100, but two parameters were missing in the instructions. Please use --max-model-len 9000 and set --gpu-memory-util 0.95 when serving the model with vllm. We've updated the information to include these details. Let us know if you have any further questions!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment