microsoft/phi-4 · Any tips to speed up inference?

3 days ago

Hi all,
and thanks Microsoft for this amazing model!

I'm using huggingface pipeline to inference from the phi-4.
However, it feels really slow using 14B model.

I'm using 4 gpus Tesla V100 (32GB each) to distribute the model in inference time.
Is there a quick and easy way to make the model faster in inference?
It would be awesome if it's some kind of parameter in a pipeline function.

Thanks !
Lino Hong.

JLouisBiz

about 18 hours ago

Using llama.cpp is an excellent way to speed up inference for large language models like Phi-4, especially if you want to run the model efficiently on CPUs or even GPUs with minimal overhead. llama.cpp is optimized for inference and supports quantization, which can significantly reduce the model size and improve speed without a large drop in accuracy.

Here’s how you can use llama.cpp with your Phi-4 model:

1. Convert the Model to GGUF Format

llama.cpp uses the GGUF format for models. You need to convert your Hugging Face model to this format.

Steps:

Clone the llama.cpp repository:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Install dependencies:
```
make
```
Convert the Hugging Face model to GGUF format:
- First, install the required Python dependencies:
```
pip install torch transformers
```
- Use the convert-hf-to-gguf.py script provided in llama.cpp:
```
python3 convert-hf-to-gguf.py --model /path/to/phi-4 --outfile /path/to/phi-4-gguf
```
  Replace /path/to/phi-4 with the path to your Hugging Face model and /path/to/phi-4-gguf with the desired output path.

2. Quantize the Model (Optional but Recommended)

Quantization reduces the model size and speeds up inference. llama.cpp supports several quantization levels (e.g., q4_0, q4_1, q5_0, etc.).

Steps:

Run the quantization script:
```
./quantize /path/to/phi-4-gguf /path/to/phi-4-gguf-q4_0 q4_0
```
This will create a quantized version of the model at /path/to/phi-4-gguf-q4_0.

3. Run Inference with `llama.cpp`

Once the model is converted and optionally quantized, you can run inference using llama.cpp.

Steps:

Run the main executable:
```
./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time"
```
- -m: Path to the GGUF model.
- -p: Prompt for inference.
For GPU acceleration (if supported), use the --n-gpu-layers flag:
```
./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --n-gpu-layers 20
```
Replace 20 with the number of layers you want to offload to the GPU.

4. Advanced Options

Batch Size: Use the -b flag to set the batch size.
Threads: Use the -t flag to specify the number of CPU threads.
Temperature and Top-p Sampling: Use --temp and --top-p for better control over text generation.

Example:

./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --temp 0.7 --top-p 0.9 -t 8 -b 512 --n-gpu-layers 20

5. Benchmarking

To measure performance, use the --benchmark flag:

./main -m /path/to/phi-4-gguf-q4_0 --benchmark

6. Using `llama.cpp` in Python

If you prefer to use llama.cpp in a Python script, you can use the llama-cpp-python package.

Steps:

Install the package:
```
pip install llama-cpp-python
```

Load and run the model:

from llama_cpp import Llama

# Load the model
llm = Llama(model_path="/path/to/phi-4-gguf-q4_0", n_gpu_layers=20)

# Run inference
output = llm("Once upon a time", max_tokens=50)
print(output["choices"][0]["text"])

Benefits of Using `llama.cpp`

Efficiency: Optimized for CPU and GPU inference.
Quantization: Reduces model size and speeds up inference.
Portability: Runs on a wide range of hardware, including CPUs and GPUs.
Minimal Dependencies: Lightweight and easy to set up.

By using llama.cpp, you can achieve faster inference times and lower resource usage compared to running the model directly through Hugging Face Transformers.

LinoHong

about 13 hours ago

@JLouisBiz
Oh my gosh
this is perfect!!!!!!
Thanks a million!!! :D

Any tips to speed up inference?

1. Convert the Model to GGUF Format

Steps:

2. Quantize the Model (Optional but Recommended)

Steps:

3. Run Inference with llama.cpp

Steps:

4. Advanced Options

5. Benchmarking

6. Using llama.cpp in Python

Steps:

Benefits of Using llama.cpp

3. Run Inference with `llama.cpp`

6. Using `llama.cpp` in Python

Benefits of Using `llama.cpp`