File size: 8,103 Bytes
226a0fa 38a9b34 226a0fa 782c34e 226a0fa 19ce7a6 09cc225 34f893f 561db4a ea88c2c 5ba3b58 76e21df 803a4b9 76e21df 74610e8 226a0fa 74610e8 67a1aa7 1cc6b00 226a0fa 74610e8 38a9b34 226a0fa 92e895f 226a0fa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
language:
- en
author: Joseph717171 & froggeric (https://huggingface.co/datasets/froggeric/imatrix/edit/main/README.md)
---
# All credit for this wonderful Repo Card detailing and explaining the similarities and differences of computed imatrices and detailing and explaining the differences, similarities, and, highlighted significances of training datasets and their purported purposes for particular large language models, goes to [froggeric](https://huggingface.co/datasets/froggeric/imatrix).
# Note: All uploaded imatrices to this repo are pre-computed, and are, therefore, ready to be used in llama.cpp's quantization process.
# Note: Imatrices uploaded to this repo follow the following naming convention: model-name_training-dataset.imatrix (hyphens are purely used in this example to enhance readability...)
# Instructions: Download the imatrix for your chosen LLM (Large Language Model), and quantize to your preferred QuantType. (Note the following example already assumes you converted your model to GGUF)
```
llama.cpp % ./quantize --imatrix path_to_imatrix path_to_model/ggml-model-f16.gguf model_name-QuantType.gguf QuantType
```
# Note: If you need detailed steps to convert your Large Language Model to GGUF, please scroll to the bottom of this page and check out the section: How to convert Supported LLMs (Large Language Models) to GGUF format
# Supplementary Learning: Training Datasets, Their Similarities and Differences, and How to Determine Which one will Be Right for Computing your Imatrix
# Input files for generating the Importance Matrix
## Which file to use for generating the importance matrix
Not all importance matrices are equal. The best results are obtained when using a source file similar to the
training data. Size also matters: the bigger the model (eg: 70b vs 13b) and the higher the quant (eg: q6k_ vs iq3_xs),
the bigger the source file needs to be to make an impact. Multiple input files can be combined if needed;
for example:
```
cat technical.txt multilingual.txt wiki.txt >custom.matrix
```
Note on **context size** when generating the matrix: in general, a small context size such as 512 is recommended, and community
tests have shown it usually performs than a larger one such as 4096. However, I would argue this is is highly dependent on the
source data you are using: with random tokens or short text a small context makes sense; but when using larger texts, a larger
context matching the size of the texts might be a better choice. Remember that the size is in tokens, which roughly translates
to number of words, not characters.
You will find below descriptions for the various input files provided, to help you choose the correct one.
## Community provided files
**groups_merged**\
_"Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!)
This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data.
I get lower KL div than wikitext for the same length and the outputs seem qualitatively better."_ (kalomaze)\
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384
**group_10_merged**\
(superseeded by groups_merged)\
_"This is about ~50k pseudo-random tokens.
I am getting the best balance between the maximum divergence and the other divergence statistics using this file when quantizing 7b"_ (kalomaze)\
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8349233
**20k_random_data**\
(superseeded by groups_10_merged)\
https://github.com/ggerganov/llama.cpp/discussions/5006#discussioncomment-8163190
**8k_random_data**\
(superseeded by 20k_random_data)\
https://github.com/ggerganov/llama.cpp/discussions/5006#discussion-6087829
**badwords**\
402 english words that can be considered dirty, naughty, obscene, or otherwise bad words.
This could be useful to remove guard rails.
Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master)
**badwords_multilingual**\
2580 words that can be considered dirty, naughty, obscene, or otherwise bad words. Includes 26 languages.
This could be useful to remove guard rails.
Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master)
**ptb.train**\
Penn Treebank (PTB) is a widely used preprocessed large dataset designed for language training. Casing,
punctuation and numbers have been removed from the training data. Recently it has kind of been superseeded
by WikiText which does not have these removals, features a larger vocabulary and full articles (better
suited for models that can take advantage of long term dependencies). However, for importantce matrix training,
PTB is still a valid dataset, which has the advantage of being manually curated, and similar to WikiText,
without being WikiText; this can help against bias.
**WikiText**\
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of
verified Good and Featured articles on Wikipedia. Compared to PTB, WikiText-2 is over 2 times larger and
WikiText-103 is over 110 times larger. As it is composed of full articles, the dataset is well suited for models
that can take advantage of long term dependencies.\
https://huggingface.co/datasets/wikitext
**WikiText_FR**\
70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles"
or "good articles".\
https://huggingface.co/datasets/asi/wikitext_fr
**c4**\
The C4 dataset is a collection text sourced from the public Common Crawl web scrape.
It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish)
in addition to extensive deduplication. C4 dataset was explicitly designed to be English only:
any page that was not given a probability of at least 99% of being English by langdetect was discarded.
**code** (exllamav2)\
Programming
**multilingual** (exllamav2)\
English, Arabic, Chinese, French, German, Japanese, Polish, Russian, Spanish, Swedish, Turkish, Hebrew,
Macedonian, Norwegian, Lithuanian, Greek, Italian, Afrikaans, Dutch, Danish.
**technical** (exllamav2)\
Technical writing.
**tiny**\
Very short stories. Be mindful of the prevalence of _"Once upon a time"_ and _"<|endoftext|>"_.
Extract from [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories)
**wiki** (exllamav2)\
Small Wikipedia dump. Unclean, contains many unwanted tags.
exllamav2 calibration data taken from:\
https://github.com/turboderp/exllamav2/tree/master/conversion/standard_cal_data
# How to Convert Supported LLMs (Large Language Models) to GGUF Format:
```
llama.cpp % python convert.py path_to_model --outtype f16
```
## How to quantize using an imatrix, with llama.cpp
1. Get one of the input files collected here, or elsewhere.
2. Convert or download the model you want to quantise, in fp16 GGUF format.
3. Generate an imatrix file specific to the model you want to quantise
```
cd <llama.cpp directory>
./imatrix -m <model_path>/ggml-model-f16.gguf -f <plain_text_matrix_file> -o <output.matrix> -t 12 -ngl 144 --chunks 100 -b 512 -c 512
# -ngl : layers offloaded to gpu (recommended to use number of layers the model contains)
# -t 12 : number of threads (should probably match no of cpu)
# -c 512 : context size, testing seems to show 512 is recommended (default=512, 0=loaded from model)
# -b 200 : batch size (default=512)
# --chunks 100 (recommended)
# --mlock : keep model in ram (only use if you had sufficient RAM for the whole fp16)
```
4. Use the generated matrix file to quantise the model
```
./quantize --imatrix <output.matrix> <model_path>/ggml-model-f16.gguf <quantisation_level, eg:IQ4_XS>
```
Note: normal quantisation also benefits from using a matrix file. It also seem that a bigger input matrix is
better for higher quantisation. |