Text Generation
GGUF
code
Eval Results
Inference Endpoints
maddes8cht commited on
Commit
0cd3026
1 Parent(s): a43bb4e

"Update README.md"

Browse files
Files changed (1) hide show
  1. README.md +349 -0
README.md ADDED
@@ -0,0 +1,349 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - bigcode/starcoderdata
5
+ - tiiuae/falcon-refinedweb
6
+ - togethercomputer/RedPajama-Data-1T
7
+ metrics:
8
+ - accuracy
9
+ - code_eval
10
+ model-index:
11
+ - name: long_llama_3b_v1_1
12
+ results:
13
+ - task:
14
+ name: Code Generation
15
+ type: code-generation
16
+ dataset:
17
+ name: "HumanEval"
18
+ type: openai_humaneval
19
+ metrics:
20
+ - name: pass@1
21
+ type: pass@1
22
+ value: 0.12
23
+ verified: false
24
+ pipeline_tag: text-generation
25
+ tags:
26
+ - code
27
+ ---
28
+ [![banner](https://maddes8cht.github.io/assets/buttons/Huggingface-banner.jpg)]()
29
+
30
+ I'm constantly enhancing these model descriptions to provide you with the most relevant and comprehensive information
31
+
32
+ # long_llama_3b_v1_1 - GGUF
33
+ - Model creator: [syzymon](https://huggingface.co/syzymon)
34
+ - Original model: [long_llama_3b_v1_1](https://huggingface.co/syzymon/long_llama_3b_v1_1)
35
+
36
+ OpenLlama is a free reimplementation of the original Llama Model which is licensed under Apache 2 license.
37
+
38
+
39
+
40
+ # About GGUF format
41
+
42
+ `gguf` is the current file format used by the [`ggml`](https://github.com/ggerganov/ggml) library.
43
+ A growing list of Software is using it and can therefore use this model.
44
+ The core project making use of the ggml library is the [llama.cpp](https://github.com/ggerganov/llama.cpp) project by Georgi Gerganov
45
+
46
+ # Quantization variants
47
+
48
+ There is a bunch of quantized files available to cater to your specific needs. Here's how to choose the best option for you:
49
+
50
+ # Legacy quants
51
+
52
+ Q4_0, Q4_1, Q5_0, Q5_1 and Q8 are `legacy` quantization types.
53
+ Nevertheless, they are fully supported, as there are several circumstances that cause certain model not to be compatible with the modern K-quants.
54
+ ## Note:
55
+ Now there's a new option to use K-quants even for previously 'incompatible' models, although this involves some fallback solution that makes them not *real* K-quants. More details can be found in affected model descriptions.
56
+ (This mainly refers to Falcon 7b and Starcoder models)
57
+
58
+ # K-quants
59
+
60
+ K-quants are designed with the idea that different levels of quantization in specific parts of the model can optimize performance, file size, and memory load.
61
+ So, if possible, use K-quants.
62
+ With a Q6_K, you'll likely find it challenging to discern a quality difference from the original model - ask your model two times the same question and you may encounter bigger quality differences.
63
+
64
+
65
+
66
+
67
+ ---
68
+
69
+ # Original Model Card:
70
+ # LongLLaMA: Focused Transformer Training for Context Scaling
71
+
72
+ <div align="center">
73
+
74
+ <a href="https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_colab.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg"></a>
75
+
76
+ </div>
77
+
78
+ <div align="center">
79
+
80
+ [TLDR](#TLDR) | [Overview](#Overview) | [Usage](#Usage) | [LongLLaMA performance](#LongLLaMA-performance) | [Authors](#Authors) | [Citation](#Citation) | [License](#License) | [Acknowledgments](#Acknowledgments)
81
+
82
+ </div>
83
+
84
+ ## TLDR
85
+ This repo contains [LongLLaMA-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_v1_1).
86
+
87
+ LongLLaMA is built upon the foundation of [OpenLLaMA](https://github.com/openlm-research/open_llama) and fine-tuned using the [Focused Transformer (FoT)](https://arxiv.org/abs/2307.03170) method. We release a smaller 3B base variant (not instruction tuned) of the LongLLaMA model on a permissive license (Apache 2.0) and inference code supporting longer contexts on [Hugging Face](https://huggingface.co/syzymon/long_llama_3b). Our model weights can serve as the drop-in replacement of LLaMA in existing implementations (for short context up to 2048 tokens). Additionally, we provide evaluation results and comparisons against the original OpenLLaMA models. Stay tuned for further updates.
88
+
89
+
90
+
91
+
92
+ ## Overview
93
+
94
+ ### Base models
95
+ [Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) (FoT) presents a simple method for endowing language models with the ability to handle context consisting possibly of millions of tokens while training on significantly shorter input. FoT permits a subset of attention layers to access a memory cache of (key, value) pairs to extend the context length. The distinctive aspect of FoT is its training procedure, drawing from contrastive learning. Specifically, we deliberately expose the memory attention layers to both relevant and irrelevant keys (like negative samples from unrelated documents). This strategy incentivizes the model to differentiate keys connected with semantically diverse values, thereby enhancing their structure. This, in turn, makes it possible to extrapolate the effective context length much beyond what is seen in training.
96
+
97
+
98
+ **LongLLaMA** is an [OpenLLaMA](https://github.com/openlm-research/open_llama) model finetuned with the FoT method,
99
+ with three layers used for context extension. **Crucially, LongLLaMA is able to extrapolate much beyond the context length seen in training: $8k$. E.g., in the passkey retrieval task, it can handle inputs of length $256k$**.
100
+
101
+ <div align="center">
102
+
103
+ | | [LongLLaMA-3B](https://huggingface.co/syzymon/long_llama_3b) | [LongLLaMA-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_v1_1) | LongLLaMA-7B<br />*(coming soon)*| LongLLaMA-13B<br />*(coming soon)*|
104
+ |----------------|----------|----------|-----------|-----------|
105
+ | Source model | [OpenLLaMA-3B](https://huggingface.co/openlm-research/open_llama_3b_easylm) | [OpenLLaMA-3Bv2](https://huggingface.co/openlm-research/open_llama_3b_v2_easylm) | - | - |
106
+ | Source model tokens | 1T | 1 T | - | - |
107
+ | Fine-tuning tokens | 10B | 5B | - | -|
108
+ | Memory layers | 6, 12, 18 | 6, 12, 18 | - | -|
109
+
110
+ </div>
111
+
112
+ ### Instruction/Chat tuning
113
+
114
+ In the [fine_tuning](fine_tuning) subfolder we provide the code that was used to create [LongLLaMA-Instruct-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_instruct), an instruction-tuned version of [LongLLaMA-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_v1_1). We used [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) (instructions) and [zetavg/ShareGPT-Processed](https://huggingface.co/datasets/zetavg/ShareGPT-Processed) (chat) datasets for tuning.
115
+
116
+
117
+ ## Usage
118
+
119
+ See also:
120
+ * [Colab with LongLLaMA-Instruct-3Bv1.1](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_instruct_colab.ipynb).
121
+ * [Colab with an example usage of base LongLLaMA](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_colab.ipynb).
122
+ ### Requirements
123
+ ```
124
+ pip install --upgrade pip
125
+ pip install transformers==4.30 sentencepiece accelerate
126
+ ```
127
+
128
+ ### Loading model
129
+ ```python
130
+ import torch
131
+ from transformers import LlamaTokenizer, AutoModelForCausalLM
132
+
133
+ tokenizer = LlamaTokenizer.from_pretrained("syzymon/long_llama_3b_v1_1")
134
+ model = AutoModelForCausalLM.from_pretrained("syzymon/long_llama_3b_v1_1",
135
+ torch_dtype=torch.float32,
136
+ trust_remote_code=True)
137
+ ```
138
+
139
+ ### Input handling and generation
140
+ LongLLaMA uses the Hugging Face interface, the long input given to the model will be
141
+ split into context windows and loaded into the memory cache.
142
+ ```python
143
+ prompt = "My name is Julien and I like to"
144
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
145
+ outputs = model(input_ids=input_ids)
146
+ ```
147
+ During the model call, one can provide the parameter `last_context_length` (default $1024$), which specifies the number of tokens left in the last context window. Tuning this parameter can improve generation as the first layers do not have access to memory. See details in [How LongLLaMA handles long inputs](#How-LongLLaMA-handles-long-inputs).
148
+
149
+ ```python
150
+ generation_output = model.generate(
151
+ input_ids=input_ids,
152
+ max_new_tokens=256,
153
+ num_beams=1,
154
+ last_context_length=1792,
155
+ do_sample=True,
156
+ temperature=1.0,
157
+ )
158
+ print(tokenizer.decode(generation_output[0]))
159
+ ```
160
+
161
+ ### Additional configuration
162
+ LongLLaMA has several other parameters:
163
+ * `mem_layers` specifies layers endowed with memory (should be either an empty list or a list of all memory layers specified in the description of the checkpoint).
164
+ * `mem_dtype` allows changing the type of memory cache
165
+ * `mem_attention_grouping` can trade off speed for reduced memory usage.
166
+ When equal to `(4, 2048)`, the memory layers will process at most $4*2048$ queries at once ($4$ heads and $2048$ queries for each head).
167
+
168
+ ```python
169
+ import torch
170
+ from transformers import LlamaTokenizer, AutoModelForCausalLM
171
+
172
+ tokenizer = LlamaTokenizer.from_pretrained("syzymon/long_llama_3b_v1_1")
173
+ model = AutoModelForCausalLM.from_pretrained(
174
+ "syzymon/long_llama_3b_v1_1", torch_dtype=torch.float32,
175
+ mem_layers=[],
176
+ mem_dtype='bfloat16',
177
+ trust_remote_code=True,
178
+ mem_attention_grouping=(4, 2048),
179
+ )
180
+ ```
181
+
182
+
183
+ ### Drop-in use with LLaMA code
184
+ LongLLaMA checkpoints can also be used as a drop-in replacement for LLaMA checkpoints in [Hugging Face implementation of LLaMA](https://huggingface.co/docs/transformers/main/model_doc/llama), but in this case, they will be limited to the original context length of $2048$.
185
+
186
+ ```python
187
+ from transformers import LlamaTokenizer, LlamaForCausalLM
188
+ import torch
189
+
190
+ tokenizer = LlamaTokenizer.from_pretrained("syzymon/long_llama_3b_v1_1")
191
+ model = LlamaForCausalLM.from_pretrained("syzymon/long_llama_3b_v1_1", torch_dtype=torch.float32)
192
+ ```
193
+
194
+
195
+ ### How LongLLaMA handles long inputs
196
+ Inputs over $2048$ tokens are automatically split into windows $w_1, \ldots, w_m$. The first $m-2$ windows contain $2048$ tokens each, $w_{m-1}$ has no more than $2048$ tokens, and $w_m$ contains the number of tokens specified by `last_context_length`. The model processes the windows one by one extending the memory cache after each. If `use_cache` is `True`, the last window will not be loaded to the memory cache but to the local (generation) cache.
197
+
198
+ The memory cache stores $(key, value)$ pairs for each head of the specified memory layers `mem_layers`. In addition to this, it stores attention masks.
199
+
200
+ If `use_cache=True` (which is the case in generation), LongLLaMA will use two caches: the memory cache for the specified layers and the local (generation) cache for all layers. When the local cache exceeds $2048$ elements, its content is moved to the memory cache for the memory layers.
201
+
202
+ For simplicity, context extension is realized with a memory cache and full attention in this repo. Replacing this simple mechanism with a KNN search over an external database is possible with systems like [Faiss](https://github.com/facebookresearch/faiss). This potentially would enable further context length scaling. We leave this as a future work.
203
+
204
+
205
+ ## LongLLaMA performance
206
+ We present some illustrative examples of LongLLaMA results. Refer to our paper [Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) for more details.
207
+
208
+ We manage to achieve good performance on the passkey retrieval task from [Landmark Attention: Random-Access Infinite Context Length for Transformers](https://arxiv.org/abs/2305.16300). The code for generating the prompt and running the model is located in `examples/passkey.py`.
209
+
210
+ <p align="center" width="100%">
211
+ <img src="assets/plot_passkey.png" alt="LongLLaMA" style="width: 70%; min-width: 300px; display: block; margin: auto;">
212
+ </p>
213
+
214
+ Our LongLLaMA 3B model also shows improvements when using long context on two downstream tasks, TREC question classification and WebQS question answering.
215
+ <div align="center">
216
+
217
+
218
+ | Context/Dataset | TREC | WebQS |
219
+ | --- | --- | --- |
220
+ | $2K$ | 67.0 | 21.2 |
221
+ | $4K$ | 71.6 | 21.4 |
222
+ | $6K$ | 72.9 | 22.2 |
223
+ | $8K$ | **73.3** | **22.4** |
224
+
225
+ </div>
226
+
227
+ LongLLaMA retains performance on tasks that do not require long context. We provide a comparison with OpenLLaMA
228
+ on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) in the zero-shot setting.
229
+ <div align="center">
230
+
231
+ | Task/Metric | OpenLLaMA-3B | LongLLaMA-3B |
232
+ |----------------|----------|-----------|
233
+ | anli_r1/acc | 0.33 | 0.32 |
234
+ | anli_r2/acc | 0.32 | 0.33 |
235
+ | anli_r3/acc | 0.35 | 0.35 |
236
+ | arc_challenge/acc | 0.34 | 0.34 |
237
+ | arc_challenge/acc_norm | 0.37 | 0.37 |
238
+ | arc_easy/acc | 0.69 | 0.68 |
239
+ | arc_easy/acc_norm | 0.65 | 0.63 |
240
+ | boolq/acc | 0.68 | 0.68 |
241
+ | hellaswag/acc | 0.49 | 0.48 |
242
+ | hellaswag/acc_norm | 0.67 | 0.65 |
243
+ | openbookqa/acc | 0.27 | 0.28 |
244
+ | openbookqa/acc_norm | 0.40 | 0.38 |
245
+ | piqa/acc | 0.75 | 0.73 |
246
+ | piqa/acc_norm | 0.76 | 0.75 |
247
+ | record/em | 0.88 | 0.87 |
248
+ | record/f1 | 0.89 | 0.87 |
249
+ | rte/acc | 0.58 | 0.60 |
250
+ | truthfulqa_mc/mc1 | 0.22 | 0.24 |
251
+ | truthfulqa_mc/mc2 | 0.35 | 0.38 |
252
+ | wic/acc | 0.48 | 0.50 |
253
+ | winogrande/acc | 0.62 | 0.60 |
254
+ | Avg score | 0.53 | 0.53 |
255
+
256
+ </div>
257
+
258
+ Starting with v1.1 models we have decided to use [EleutherAI](https://github.com/EleutherAI) implementation of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) wit a slight modification, that adds `<bos>` token at beginning of input sequence. The results are provided in the table below.
259
+
260
+ <div align="center">
261
+
262
+ | description | LongLLaMA-3B | OpenLLaMA-3Bv2 | LongLLaMA-3Bv1.1 | LongLLaMA-Instruct-3Bv1.1 |
263
+ |:-----------------------|:--------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------|
264
+ | anli_r1/acc | 0.32 | 0.33 | 0.31 | 0.33 |
265
+ | anli_r2/acc | 0.33 | 0.35 | 0.33 | 0.35 |
266
+ | anli_r3/acc | 0.35 | 0.38 | 0.35 | 0.38 |
267
+ | arc_challenge/acc | 0.34 | 0.33 | 0.32 | 0.36 |
268
+ | arc_challenge/acc_norm | 0.37 | 0.36 | 0.36 | 0.37 |
269
+ | arc_easy/acc | 0.67 | 0.68 | 0.68 | 0.7 |
270
+ | arc_easy/acc_norm | 0.63 | 0.63 | 0.63 | 0.63 |
271
+ | boolq/acc | 0.68 | 0.67 | 0.66 | 0.77 |
272
+ | hellaswag/acc | 0.48 | 0.53 | 0.52 | 0.52 |
273
+ | hellaswag/acc_norm | 0.65 | 0.7 | 0.69 | 0.68 |
274
+ | openbookqa/acc | 0.28 | 0.28 | 0.28 | 0.28 |
275
+ | openbookqa/acc_norm | 0.38 | 0.39 | 0.37 | 0.41 |
276
+ | piqa/acc | 0.73 | 0.77 | 0.77 | 0.78 |
277
+ | piqa/acc_norm | 0.75 | 0.78 | 0.77 | 0.77 |
278
+ | record/em | 0.87 | 0.87 | 0.86 | 0.85 |
279
+ | record/f1 | 0.88 | 0.88 | 0.87 | 0.86 |
280
+ | rte/acc | 0.6 | 0.53 | 0.62 | 0.7 |
281
+ | truthfulqa_mc/mc1 | 0.24 | 0.22 | 0.21 | 0.25 |
282
+ | truthfulqa_mc/mc2 | 0.38 | 0.35 | 0.35 | 0.4 |
283
+ | wic/acc | 0.5 | 0.5 | 0.5 | 0.54 |
284
+ | winogrande/acc | 0.6 | 0.66 | 0.63 | 0.65 |
285
+ | Avg score | 0.53 | 0.53 | 0.53 | 0.55 |
286
+
287
+ </div>
288
+
289
+
290
+ We also provide the results on human-eval. We cut the generated text after either
291
+ * `"\ndef "`
292
+ * `"\nclass "`
293
+ * `"\nif __name__"`
294
+
295
+ <div align="center">
296
+
297
+ | | OpenLLaMA-3Bv2 | LongLLaMA-3Bv1.1 | LongLLaMA-Instruct-3Bv1.1 |
298
+ | - | - | - | - |
299
+ | pass@1| 0.09| 0.12 | 0.12 |
300
+
301
+ </div>
302
+
303
+ ## Authors
304
+ - [Szymon Tworkowski](https://scholar.google.com/citations?user=1V8AeXYAAAAJ&hl=en)
305
+ - [Konrad Staniszewski](https://scholar.google.com/citations?user=CM6PCBYAAAAJ)
306
+ - [Mikołaj Pacek](https://scholar.google.com/citations?user=eh6iEbQAAAAJ&hl=en&oi=ao)
307
+ - [Henryk Michalewski](https://scholar.google.com/citations?user=YdHW1ycAAAAJ&hl=en)
308
+ - [Yuhuai Wu](https://scholar.google.com/citations?user=bOQGfFIAAAAJ&hl=en)
309
+ - [Piotr Miłoś](https://scholar.google.pl/citations?user=Se68XecAAAAJ&hl=pl&oi=ao)
310
+
311
+
312
+ ## Citation
313
+ To cite this work please use
314
+ ```bibtex
315
+ @misc{tworkowski2023focused,
316
+ title={Focused Transformer: Contrastive Training for Context Scaling},
317
+ author={Szymon Tworkowski and Konrad Staniszewski and Mikołaj Pacek and Yuhuai Wu and Henryk Michalewski and Piotr Miłoś},
318
+ year={2023},
319
+ eprint={2307.03170},
320
+ archivePrefix={arXiv},
321
+ primaryClass={cs.CL}
322
+ }
323
+ ```
324
+
325
+
326
+ ## License
327
+ The code and base models checkpoints are licensed under [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
328
+ The instruction/chat tuned models are for research purposes only.
329
+ Some of the examples use external code (see headers of files for copyright notices and licenses).
330
+
331
+ ## Acknowledgments
332
+ We gratefully acknowledge the TPU Research Cloud program, which was instrumental to our research by providing significant computational resources. We are also grateful to Xinyang Geng and Hao Liu for releasing [OpenLLaMA](https://github.com/openlm-research/open_llama) checkpoints and the [EasyLM](https://github.com/young-geng/EasyLM) library.
333
+
334
+ ***End of original Model File***
335
+ ---
336
+
337
+
338
+ ## Please consider to support my work
339
+ **Coming Soon:** I'm in the process of launching a sponsorship/crowdfunding campaign for my work. I'm evaluating Kickstarter, Patreon, or the new GitHub Sponsors platform, and I am hoping for some support and contribution to the continued availability of these kind of models. Your support will enable me to provide even more valuable resources and maintain the models you rely on. Your patience and ongoing support are greatly appreciated as I work to make this page an even more valuable resource for the community.
340
+
341
+ <center>
342
+
343
+ [![GitHub](https://maddes8cht.github.io/assets/buttons/github-io-button.png)](https://maddes8cht.github.io)
344
+ [![Stack Exchange](https://stackexchange.com/users/flair/26485911.png)](https://stackexchange.com/users/26485911)
345
+ [![GitHub](https://maddes8cht.github.io/assets/buttons/github-button.png)](https://github.com/maddes8cht)
346
+ [![HuggingFace](https://maddes8cht.github.io/assets/buttons/huggingface-button.png)](https://huggingface.co/maddes8cht)
347
+ [![Twitter](https://maddes8cht.github.io/assets/buttons/twitter-button.png)](https://twitter.com/maddes1966)
348
+
349
+ </center>