Passing --max-model-len expected to work

by bakbeest - opened 1 day ago

1 day ago

Is --max-model-len still expected to work with vLLM when using --tokenizer_mode mistral --config_format mistral --load_format mistral, it seems not as I am OOM'ing with sizes that I should be able to run.

I'm just a GPU poor trying to run this AWQ quant on 4x4090 and can't load full context length. Can run the model with a decent length if I don't use the mistral flags though, but then tool calling won't work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment