Any tips to speed up inference?
Hi all,
and thanks Microsoft for this amazing model!
I'm using huggingface pipeline to inference from the phi-4.
However, it feels really slow using 14B model.
I'm using 4 gpus Tesla V100 (32GB each) to distribute the model in inference time.
Is there a quick and easy way to make the model faster in inference?
It would be awesome if it's some kind of parameter in a pipeline function.
Thanks !
Lino Hong.
Using llama.cpp
is an excellent way to speed up inference for large language models like Phi-4, especially if you want to run the model efficiently on CPUs or even GPUs with minimal overhead. llama.cpp
is optimized for inference and supports quantization, which can significantly reduce the model size and improve speed without a large drop in accuracy.
Here’s how you can use llama.cpp
with your Phi-4 model:
1. Convert the Model to GGUF Format
llama.cpp
uses the GGUF format for models. You need to convert your Hugging Face model to this format.
Steps:
Clone the
llama.cpp
repository:git clone https://github.com/ggerganov/llama.cpp cd llama.cpp
Install dependencies:
make
Convert the Hugging Face model to GGUF format:
- First, install the required Python dependencies:
pip install torch transformers
- Use the
convert-hf-to-gguf.py
script provided inllama.cpp
:
Replacepython3 convert-hf-to-gguf.py --model /path/to/phi-4 --outfile /path/to/phi-4-gguf
/path/to/phi-4
with the path to your Hugging Face model and/path/to/phi-4-gguf
with the desired output path.
- First, install the required Python dependencies:
2. Quantize the Model (Optional but Recommended)
Quantization reduces the model size and speeds up inference. llama.cpp
supports several quantization levels (e.g., q4_0
, q4_1
, q5_0
, etc.).
Steps:
- Run the quantization script:
This will create a quantized version of the model at./quantize /path/to/phi-4-gguf /path/to/phi-4-gguf-q4_0 q4_0
/path/to/phi-4-gguf-q4_0
.
3. Run Inference with llama.cpp
Once the model is converted and optionally quantized, you can run inference using llama.cpp
.
Steps:
Run the
main
executable:./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time"
-m
: Path to the GGUF model.-p
: Prompt for inference.
For GPU acceleration (if supported), use the
--n-gpu-layers
flag:./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --n-gpu-layers 20
Replace
20
with the number of layers you want to offload to the GPU.
4. Advanced Options
- Batch Size: Use the
-b
flag to set the batch size. - Threads: Use the
-t
flag to specify the number of CPU threads. - Temperature and Top-p Sampling: Use
--temp
and--top-p
for better control over text generation.
Example:
./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --temp 0.7 --top-p 0.9 -t 8 -b 512 --n-gpu-layers 20
5. Benchmarking
To measure performance, use the --benchmark
flag:
./main -m /path/to/phi-4-gguf-q4_0 --benchmark
6. Using llama.cpp
in Python
If you prefer to use llama.cpp
in a Python script, you can use the llama-cpp-python
package.
Steps:
Install the package:
pip install llama-cpp-python
Load and run the model:
from llama_cpp import Llama # Load the model llm = Llama(model_path="/path/to/phi-4-gguf-q4_0", n_gpu_layers=20) # Run inference output = llm("Once upon a time", max_tokens=50) print(output["choices"][0]["text"])
Benefits of Using llama.cpp
- Efficiency: Optimized for CPU and GPU inference.
- Quantization: Reduces model size and speeds up inference.
- Portability: Runs on a wide range of hardware, including CPUs and GPUs.
- Minimal Dependencies: Lightweight and easy to set up.
By using llama.cpp
, you can achieve faster inference times and lower resource usage compared to running the model directly through Hugging Face Transformers.
@JLouisBiz
Oh my gosh
this is perfect!!!!!!
Thanks a million!!! :D