What engine should be used to infer this model?

#1
by RobertLiu0905 - opened

Thank you for you contribution,my question is : What engine should be used to infer this model?

wondering whether this model is quantized by https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w4a16.py . could you offer any quantize details?

how use vllm run this model , could you give some tips or examples

NM Testing org
edited Oct 31

Why does the speed of the quantized model decrease significantly?

NM Testing org

How are you running it?

How are you running it?

after finished deepseek_moe_w4a16.py, you will get a int4 model, size should near 112G, then run it with vllm 0.6, i failed with 054 version, try to skip it. https://github.com/vllm-project/llm-compressor/issues/857

Sign up or log in to comment