bhenrym14
/

airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ

Text Generation

Transformers

llama

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 6, 2023

Commit

5db64ea

•

1 Parent(s): dc39e0f

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -2,7 +2,7 @@
 datasets:
 - jondurbin/airoboros-gpt4-1.4.1
 ---
-# RoPE Scaled QLoRA Finetune of airoboros-13b-gpt4-1.4.1 (GPTQ)
 LoRA Weights can be found here: https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-LoRA
@@ -14,16 +14,17 @@ This is [Jon Durbin's Airoboros 13B GPT4 1.4](https://huggingface.co/jondurbin/a
 - Context length extended to 8192 by RoPE Scaled Embeddings, but NOT via the superHOT LoRA. I started with base Llama-13b.
 - Training sequences beyond 2048 have the target truncated to equal 2048.
 - Used airoboros-gpt4-1.4.1 dataset instead of airoboros-gpt4-1.4
-- This is a QLoRA finetune. The original 13b model is a full finetune.
-Otherwise, I emulated the training process as closely as possible (rank 64 QLoRA) It was trained on 1x RTX 6000 Ada for ~18 hours.
 ## How to Use
 The easiest way is to use [oobabooga text-generation-webui](https://github.com/oobabooga/text-generation-webui) with ExLlama. You'll need to set max_seq_len to 8192 and compress_pos_emb to 4.
 ## Motivation
-Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is an adapter that has been finetuned on longer context (8192 tokens); even when applied to models trained on dissimilar datasets, it successfully extends the context window to which the model can attend. While it's impressive this adapter is so flexible, how much does performance suffer relative to a model that has been finetuned with the scaled embeddings from the start? This is an experiment to explore this.
 ## Relative Performance (perplexity)
 | Model                                                | Context (tokens)     | Perplexity |

 datasets:
 - jondurbin/airoboros-gpt4-1.4.1
 ---
+# RoPE Scaled QLoRA Fine-tune of Llama-13b on airoboros-gpt4-1.4.1 (GPTQ)
 LoRA Weights can be found here: https://huggingface.co/bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-LoRA
 - Context length extended to 8192 by RoPE Scaled Embeddings, but NOT via the superHOT LoRA. I started with base Llama-13b.
 - Training sequences beyond 2048 have the target truncated to equal 2048.
 - Used airoboros-gpt4-1.4.1 dataset instead of airoboros-gpt4-1.4
+- **This is a QLoRA fine-tune**. The original 13b model is a full fine-tune.
+Otherwise, I emulated the training process as closely as possible (rank 64 QLoRA). It was trained on 1x RTX 6000 Ada for ~18 hours.
 ## How to Use
 The easiest way is to use [oobabooga text-generation-webui](https://github.com/oobabooga/text-generation-webui) with ExLlama. You'll need to set max_seq_len to 8192 and compress_pos_emb to 4.
+If you wish to use AutoGPTQ/GPTQ-for-Llama instead, you'll need to patch in the appropriate RoPE scaling module. see: [replace_llama_rope_with_scaled_rope](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch.py)
 ## Motivation
+Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is an adapter that has been fine-tuned on longer context (8192 tokens); even when applied to models trained on dissimilar datasets, it successfully extends the context window to which the model can attend. While it's impressive this adapter is so flexible, how much does performance suffer relative to a model that has been fine-tuned with the scaled embeddings from the start? This is an experiment to explore this.
 ## Relative Performance (perplexity)
 | Model                                                | Context (tokens)     | Perplexity |