File size: 2,988 Bytes
3ec0cbc
2bfb55a
0d24f20
 
 
 
 
 
 
3ec0cbc
 
2bfb55a
 
0d24f20
 
 
 
 
 
 
 
 
 
3ec0cbc
 
2bfb55a
 
3ec0cbc
2bfb55a
 
 
 
 
 
 
 
 
 
3ec0cbc
 
 
 
 
2bfb55a
 
 
 
 
 
3ec0cbc
2bfb55a
3ec0cbc
 
2bfb55a
3ec0cbc
2bfb55a
3ec0cbc
 
2bfb55a
3ec0cbc
2bfb55a
 
 
 
3ec0cbc
2bfb55a
0d24f20
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
base_model: "NB-Llama-3.2-1B-Instruct"
language:
- no  # Generic Norwegian
- nb  # Norwegian Bokmål
- nn  # Norwegian Nynorsk
- en  # English
- sv  # Swedish
- da  # Danish
tags:
- llama-cpp
- gguf
- quantization
- norwegian
- bokmål
- nynorsk
- swedish
- danish
- multilingual
- text-generation
pipeline_tag: text-generation
license: llama3.2

---

# NB-Llama-3.2-1B-Instruct-Q4_K_M-GGUF
This model is a **quantized** version of the original [NB-Llama-3.2-1B-Instruct](https://huggingface.co/NbAiLab/nb-llama-3.2-1B-Instruct), converted into the **GGUF format** using [llama.cpp](https://github.com/ggerganov/llama.cpp). Quantization significantly reduces the model's memory footprint, enabling efficient inference on a wide range of hardware, including personal devices, without compromising too much quality. These quantized models are mainly provided so that people can test out the models with moderate hardware. If you want to benchmark the models or further finetune the models, we strongly recommend the non-quantized versions. 

## What is `llama.cpp`?
[`llama.cpp`](https://github.com/ggerganov/llama.cpp) is a versatile tool for running large language models optimized for efficiency. It supports multiple quantization formats (e.g., GGML and GGUF) and provides inference capabilities on diverse hardware, including CPUs, GPUs, and mobile devices. The GGUF format is the latest evolution, designed to enhance compatibility and performance.

## Benefits of This Model
- **High Performance**: Achieves similar quality to the original model while using significantly less memory.
- **Hardware Compatibility**: Optimized for running on a variety of hardware, including low-resource systems.
- **Ease of Use**: Seamlessly integrates with `llama.cpp` for fast and efficient inference.

## Installation
Install `llama.cpp` using Homebrew (works on Mac and Linux):

```bash
brew install llama.cpp
```

## Usage Instructions

### Using with `llama.cpp`
To use this quantized model with `llama.cpp`, follow the steps below:

#### CLI:
```bash
llama-cli --hf-repo NbAiLab/nb-llama-3.2-1B-Instruct-Q4_K_M-GGUF --hf-file nb-llama-3.2-1b-instruct-q4_k_m.gguf -p "Your prompt here"
```

#### Server:
```bash
llama-server --hf-repo NbAiLab/nb-llama-3.2-1B-Instruct-Q4_K_M-GGUF --hf-file nb-llama-3.2-1b-instruct-q4_k_m.gguf -c 2048
```

For more information, refer to the [llama.cpp repository](https://github.com/ggerganov/llama.cpp).

## Additional Resources
- [Original Model Card](https://huggingface.co/NbAiLab/nb-llama-3.2-1B-Instruct)
- [llama.cpp Repository](https://github.com/ggerganov/llama.cpp)
- [GGUF Format Documentation](https://huggingface.co/docs/transformers/main/en/model_doc/llama)

### Citing & Authors
The model was trained and documentation written by Per Egil Kummervold

### Funding and Acknowledgement
Training this model was supported by Google’s TPU Research Cloud (TRC), which generously supplied us with Cloud TPUs essential for our computational
needs..