base_model: NB-Llama-3.1-8B
tags:
- llama-cpp
- gguf
- quantization
NB-Llama-3.1-8B-Q4_K_M-GGUF
This model is a quantized version of the original NB-Llama-3.1-8B, converted into the GGUF format using llama.cpp. Quantization significantly reduces the model's memory footprint, enabling efficient inference on a wide range of hardware, including personal devices, without compromising too much quality. These quantized models are mainly provided so that people can test out the models with moderate hardware. If you want to benchmark the models or further finetune the models, we strongly recommend the non-quantized versions.
What is llama.cpp
?
llama.cpp
is a versatile tool for running large language models optimized for efficiency. It supports multiple quantization formats (e.g., GGML and GGUF) and provides inference capabilities on diverse hardware, including CPUs, GPUs, and mobile devices. The GGUF format is the latest evolution, designed to enhance compatibility and performance.
Benefits of This Model
- High Performance: Achieves similar quality to the original model while using significantly less memory.
- Hardware Compatibility: Optimized for running on a variety of hardware, including low-resource systems.
- Ease of Use: Seamlessly integrates with
llama.cpp
for fast and efficient inference.
Installation
Install llama.cpp
using Homebrew (works on Mac and Linux):
brew install llama.cpp
Usage Instructions
Using with llama.cpp
To use this quantized model with llama.cpp
, follow the steps below:
CLI:
llama-cli --hf-repo north/nb-llama-3.1-8B-Q4_K_M-GGUF --hf-file nb-llama-3.1-8b-q4_k_m.gguf -p "Your prompt here"
Server:
llama-server --hf-repo north/nb-llama-3.1-8B-Q4_K_M-GGUF --hf-file nb-llama-3.1-8b-q4_k_m.gguf -c 2048
For more information, refer to the llama.cpp repository.