File size: 8,103 Bytes
226a0fa
 
 
38a9b34
226a0fa
782c34e
226a0fa
19ce7a6
09cc225
34f893f
561db4a
ea88c2c
5ba3b58
76e21df
803a4b9
76e21df
74610e8
226a0fa
74610e8
67a1aa7
1cc6b00
 
226a0fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74610e8
38a9b34
 
 
 
226a0fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92e895f
226a0fa
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
language:
- en
author: Joseph717171 & froggeric (https://huggingface.co/datasets/froggeric/imatrix/edit/main/README.md)
---
# All credit for this wonderful Repo Card detailing and explaining the similarities and differences of computed imatrices and detailing and explaining the differences, similarities, and, highlighted significances of training datasets and their purported purposes for particular large language models, goes to [froggeric](https://huggingface.co/datasets/froggeric/imatrix).

# Note: All uploaded imatrices to this repo are pre-computed, and are, therefore, ready to be used in llama.cpp's quantization process. 

# Note: Imatrices uploaded to this repo follow the following naming convention: model-name_training-dataset.imatrix (hyphens are purely used in this example to enhance readability...)

# Instructions: Download the imatrix for your chosen LLM (Large Language Model), and quantize to your preferred QuantType. (Note the following example already assumes you converted your model to GGUF)

```
llama.cpp % ./quantize --imatrix path_to_imatrix path_to_model/ggml-model-f16.gguf model_name-QuantType.gguf QuantType
```
# Note: If you need detailed steps to convert your Large Language Model to GGUF, please scroll to the bottom of this page and check out the section: How to convert Supported LLMs (Large Language Models) to GGUF format

# Supplementary Learning: Training Datasets, Their Similarities and Differences, and How to Determine Which one will Be Right for Computing your Imatrix

# Input files for generating the Importance Matrix

## Which file to use for generating the importance matrix

Not all importance matrices are equal. The best results are obtained when using a source file similar to the
training data. Size also matters: the bigger the model (eg: 70b vs 13b) and the higher the quant (eg: q6k_ vs iq3_xs),
the bigger the source file needs to be to make an impact. Multiple input files can be combined if needed; 
for example:
```
cat technical.txt multilingual.txt wiki.txt >custom.matrix
```
Note on **context size** when generating the matrix: in general, a small context size such as 512 is recommended, and community
tests have shown it usually performs than a larger one such as 4096. However, I would argue this is is highly dependent on the
source data you are using: with random tokens or short text a small context makes sense; but when using larger texts, a larger
context matching the size of the texts might be a better choice. Remember that the size is in tokens, which roughly translates
to number of words, not characters.

You will find below descriptions for the various input files provided, to help you choose the correct one.

## Community provided files

**groups_merged**\
_"Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!)
This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data.
I get lower KL div than wikitext for the same length and the outputs seem qualitatively better."_ (kalomaze)\
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384

**group_10_merged**\
(superseeded by groups_merged)\
_"This is about ~50k pseudo-random tokens.
I am getting the best balance between the maximum divergence and the other divergence statistics using this file when quantizing 7b"_ (kalomaze)\
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8349233

**20k_random_data**\
(superseeded by groups_10_merged)\
https://github.com/ggerganov/llama.cpp/discussions/5006#discussioncomment-8163190

**8k_random_data**\
(superseeded by 20k_random_data)\
https://github.com/ggerganov/llama.cpp/discussions/5006#discussion-6087829

**badwords**\
402 english words that can be considered dirty, naughty, obscene, or otherwise bad words.
This could be useful to remove guard rails.
Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master)

**badwords_multilingual**\
2580 words that can be considered dirty, naughty, obscene, or otherwise bad words. Includes 26 languages.
This could be useful to remove guard rails.
Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master)

**ptb.train**\
Penn Treebank (PTB) is a widely used preprocessed large dataset designed for language training. Casing,
punctuation and numbers have been removed from the training data. Recently it has kind of been superseeded
by WikiText which does not have these removals, features a larger vocabulary and full articles (better
suited for models that can take advantage of long term dependencies). However, for importantce matrix training,
PTB is still a valid dataset, which has the advantage of being manually curated, and similar to WikiText,
without being WikiText; this can help against bias.

**WikiText**\
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of
verified Good and Featured articles on Wikipedia. Compared to PTB, WikiText-2 is over 2 times larger and
WikiText-103 is over 110 times larger. As it is composed of full articles, the dataset is well suited for models
that can take advantage of long term dependencies.\
https://huggingface.co/datasets/wikitext  

**WikiText_FR**\
70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles"
or "good articles".\
https://huggingface.co/datasets/asi/wikitext_fr

**c4**\
The C4 dataset is a collection text sourced from the public Common Crawl web scrape.
It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish)
in addition to extensive deduplication. C4 dataset was explicitly designed to be English only:
any page that was not given a probability of at least 99% of being English by langdetect was discarded.

**code** (exllamav2)\
Programming

**multilingual** (exllamav2)\
English, Arabic, Chinese, French, German, Japanese, Polish, Russian, Spanish, Swedish, Turkish, Hebrew,
Macedonian, Norwegian, Lithuanian, Greek, Italian, Afrikaans, Dutch, Danish. 

**technical** (exllamav2)\
Technical writing.

**tiny**\
Very short stories. Be mindful of the prevalence of _"Once upon a time"_ and _"<|endoftext|>"_.
Extract from [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories)

**wiki** (exllamav2)\
Small Wikipedia dump. Unclean, contains many unwanted tags.

exllamav2 calibration data taken from:\
https://github.com/turboderp/exllamav2/tree/master/conversion/standard_cal_data

# How to Convert Supported LLMs (Large Language Models) to GGUF Format:
```
llama.cpp % python convert.py path_to_model --outtype f16
```

## How to quantize using an imatrix, with llama.cpp

1. Get one of the input files collected here, or elsewhere.
2. Convert or download the model you want to quantise, in fp16 GGUF format.
3. Generate an imatrix file specific to the model you want to quantise
```
cd <llama.cpp directory>
./imatrix -m <model_path>/ggml-model-f16.gguf -f <plain_text_matrix_file> -o <output.matrix> -t 12 -ngl 144 --chunks 100 -b 512 -c 512

# -ngl    : layers offloaded to gpu (recommended to use number of layers the model contains)
# -t 12   : number of threads (should probably match no of cpu)
# -c 512  : context size, testing seems to show 512 is recommended (default=512, 0=loaded from model)
# -b 200  : batch size (default=512)
# --chunks 100 (recommended)
# --mlock : keep model in ram (only use if you had sufficient RAM for the whole fp16)
```
4. Use the generated matrix file to quantise the model
```
./quantize --imatrix <output.matrix> <model_path>/ggml-model-f16.gguf <quantisation_level, eg:IQ4_XS>
```
Note: normal quantisation also benefits from using a matrix file. It also seem that a bigger input matrix is
better for higher quantisation.