Any tips to speed up inference?

#35
by LinoHong - opened

Hi all,
and thanks Microsoft for this amazing model!

I'm using huggingface pipeline to inference from the phi-4.
However, it feels really slow using 14B model.

I'm using 4 gpus Tesla V100 (32GB each) to distribute the model in inference time.
Is there a quick and easy way to make the model faster in inference?
It would be awesome if it's some kind of parameter in a pipeline function.

Thanks !
Lino Hong.

Using llama.cpp is an excellent way to speed up inference for large language models like Phi-4, especially if you want to run the model efficiently on CPUs or even GPUs with minimal overhead. llama.cpp is optimized for inference and supports quantization, which can significantly reduce the model size and improve speed without a large drop in accuracy.

Here’s how you can use llama.cpp with your Phi-4 model:


1. Convert the Model to GGUF Format

llama.cpp uses the GGUF format for models. You need to convert your Hugging Face model to this format.

Steps:

  1. Clone the llama.cpp repository:

    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    
  2. Install dependencies:

    make
    
  3. Convert the Hugging Face model to GGUF format:

    • First, install the required Python dependencies:
      pip install torch transformers
      
    • Use the convert-hf-to-gguf.py script provided in llama.cpp:
      python3 convert-hf-to-gguf.py --model /path/to/phi-4 --outfile /path/to/phi-4-gguf
      
      Replace /path/to/phi-4 with the path to your Hugging Face model and /path/to/phi-4-gguf with the desired output path.

2. Quantize the Model (Optional but Recommended)

Quantization reduces the model size and speeds up inference. llama.cpp supports several quantization levels (e.g., q4_0, q4_1, q5_0, etc.).

Steps:

  1. Run the quantization script:
    ./quantize /path/to/phi-4-gguf /path/to/phi-4-gguf-q4_0 q4_0
    
    This will create a quantized version of the model at /path/to/phi-4-gguf-q4_0.

3. Run Inference with llama.cpp

Once the model is converted and optionally quantized, you can run inference using llama.cpp.

Steps:

  1. Run the main executable:

    ./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time"
    
    • -m: Path to the GGUF model.
    • -p: Prompt for inference.
  2. For GPU acceleration (if supported), use the --n-gpu-layers flag:

    ./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --n-gpu-layers 20
    

    Replace 20 with the number of layers you want to offload to the GPU.


4. Advanced Options

  • Batch Size: Use the -b flag to set the batch size.
  • Threads: Use the -t flag to specify the number of CPU threads.
  • Temperature and Top-p Sampling: Use --temp and --top-p for better control over text generation.

Example:

./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --temp 0.7 --top-p 0.9 -t 8 -b 512 --n-gpu-layers 20

5. Benchmarking

To measure performance, use the --benchmark flag:

./main -m /path/to/phi-4-gguf-q4_0 --benchmark

6. Using llama.cpp in Python

If you prefer to use llama.cpp in a Python script, you can use the llama-cpp-python package.

Steps:

  1. Install the package:

    pip install llama-cpp-python
    
  2. Load and run the model:

    from llama_cpp import Llama
    
    # Load the model
    llm = Llama(model_path="/path/to/phi-4-gguf-q4_0", n_gpu_layers=20)
    
    # Run inference
    output = llm("Once upon a time", max_tokens=50)
    print(output["choices"][0]["text"])
    

Benefits of Using llama.cpp

  • Efficiency: Optimized for CPU and GPU inference.
  • Quantization: Reduces model size and speeds up inference.
  • Portability: Runs on a wide range of hardware, including CPUs and GPUs.
  • Minimal Dependencies: Lightweight and easy to set up.

By using llama.cpp, you can achieve faster inference times and lower resource usage compared to running the model directly through Hugging Face Transformers.

@JLouisBiz
Oh my gosh
this is perfect!!!!!!
Thanks a million!!! :D

Sign up or log in to comment