Text Generation
Transformers
llama
Inference Endpoints
bhenrym14 commited on
Commit
717e68d
·
1 Parent(s): 47c54ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -28,8 +28,10 @@ This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scal
28
  1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
29
  2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
30
 
 
 
31
  ## Motivation
32
- Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation (https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
33
 
34
  Unfortunately it has also been shown that LLM's frequently struggle to attend to salient information in the middle of the context window. Attending to nearby tokens is essential to producing syntactically correct and semantically coherent sentences. Essential context is also most commonly found at the beginning of a context window. With this in mind, it is unsurprising LLMs often attend more strongly to these areas. Does this learned model behavior result in an "extrapolated deemphasis" when such embeddings are scaled? This hypothesis may be supported by the material improvements in perplexity achieved by training on long sequences (not just including the RoPE scaling during the fine-tune).
35
 
@@ -54,7 +56,7 @@ Here I explore whether training on long sequences that have clear conceptual dep
54
 
55
  ## Quantization:
56
 
57
- The merged model was quantized with AutoGPTQ (bits = 4, group_size = 64, desc_act = True).
58
 
59
  ## Prompting:
60
 
 
28
  1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
29
  2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
30
 
31
+ **Note: Due to an erronious `max_position_embeddings` figure in the base model config file, the RoPE scaling factor was computed with `original_max_position_embeddings=2048` (llama-2 should be 4096). This resulted in a scaling factor of 8 instead of 4, despite passing a new `max_position_embeddings=16384`. This could have a negative to neutral performance impact. I intend on retraining this model with the proper scaling factor. If and when I do so, I will replace the weights in this repo and make note of this change at the top of this model card.**
32
+
33
  ## Motivation
34
+ Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation [kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
35
 
36
  Unfortunately it has also been shown that LLM's frequently struggle to attend to salient information in the middle of the context window. Attending to nearby tokens is essential to producing syntactically correct and semantically coherent sentences. Essential context is also most commonly found at the beginning of a context window. With this in mind, it is unsurprising LLMs often attend more strongly to these areas. Does this learned model behavior result in an "extrapolated deemphasis" when such embeddings are scaled? This hypothesis may be supported by the material improvements in perplexity achieved by training on long sequences (not just including the RoPE scaling during the fine-tune).
37
 
 
56
 
57
  ## Quantization:
58
 
59
+ The merged model was quantized with AutoGPTQ (bits = 4, group_size = 32, desc_act = True).
60
 
61
  ## Prompting:
62