Passing --max-model-len expected to work

#7
by bakbeest - opened

Is --max-model-len still expected to work with vLLM when using --tokenizer_mode mistral --config_format mistral --load_format mistral, it seems not as I am OOM'ing with sizes that I should be able to run.

I'm just a GPU poor trying to run this AWQ quant on 4x4090 and can't load full context length. Can run the model with a decent length if I don't use the mistral flags though, but then tool calling won't work.

Sign up or log in to comment