In my testing, inference is five times slower with this AWQ model than with ExLlama (TheBloke_Mistral-7B-Instruct-v0.1-GPTQ_gptq-4bit-32g-actorder_True)
Is that to be expected?
· Sign up or log in to comment