GGUF usage with llama.cpp
You can now deploy any llama.cpp compatible GGUF on Hugging Face Endpoints, read more about it here
Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp downloads the model checkpoint and automatically caches it. The location of the cache is defined by LLAMA_CACHE
environment variable; read more about it here.
You can install llama.cpp through brew (works on Mac and Linux), or you can build it from source. There are also pre-built binaries and Docker images that you can check in the official documentation.
Option 1: Install with brew
brew install llama.cpp
Option 2: build from source
Step 1: Clone llama.cpp from GitHub.
git clone https://github.com/ggerganov/llama.cpp
Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1
flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
cd llama.cpp && LLAMA_CURL=1 make
Once installed, you can use the llama-cli
or llama-server
as follows:
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
Note: You can remove -cnv
to run the CLI in chat completion mode.
Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server:
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
After running the server you can simply utilise the endpoint as below:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{
"role": "system",
"content": "You are an AI assistant. Your top priority is achieving user fulfilment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about Python exceptions"
}
]
}'
Replace -hf
with any valid Hugging Face hub repo name - off you go! 🦙
Note: Remember to build
llama.cpp with LLAMA_CURL=1
:)