inference take more than 10 min

#38
by shravanveldurthi - opened

I am following the same code provided here for TheBloke/Llama-2-70B-chat-GPTQ. If I run on Google Colab (CUDA Version 12.0 and Torch version 2.0.1+cu118), I get summarization in less than 40 secs. Where as on Linux VM (CUDA Version 12.2 and Torch version 2.0.1+cu117), following warnings

  1. CUDA extension not installed.
  2. skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
    And it takes more than 500 seconds to provide summarization.

How to fix the error

Sign up or log in to comment