bhenrym14
/

airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ

Text Generation

Transformers

llama

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 6, 2023

Commit

d7530b9

1 Parent(s): 300be9f

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -8

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ datasets:
 - jondurbin/airoboros-gpt4-1.4.1
 ---
-**7/6: This may be a little undertrained. I'll update the weights if I end up training it longer and/or with better hyperparameters. For now, I'm working on 7b.**
 ---------------
@@ -34,18 +34,21 @@ Recent advancements in extending context by RoPE scaling ([kaiokendev](https://k
 ## Relative Performance (perplexity)
 | Model                                                | Context (tokens)     | Perplexity |
 | ---------------------------------------------------- | ----------- | ---------- |
-| TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 512        |    **7.42**    |
-| TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 2048        |    **5.01**    |
-| TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 4096        |    9848.0    |
 | TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 512        |    8.86    |
 | TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 2048        |    5.98    |
 | TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 4096        |    5.80    |
-| **bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ**    | **512**    | 7.94   |
-| **bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ**    | **2048**    | 5.28   |
-| **bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ**    | **4096**    | **5.15**   |
-- How does this reduction in perplexity translate into actual performance lift on downstream tasks? I haven't used models with the SuperHOT LoRA enough to have any sense of performance differences, but feedback on the 33b variant suggests it is noticable particularly with coherence at longer context lengths.
 - This comparison isn't perfect. I did use the 1.4.1 dataset, the quantization method is slightly different, and the finetuning method is different (QLoRA vs full). In short, there are other potentially influential variables responsible for these performance differences.
 ## Quantization:

 - jondurbin/airoboros-gpt4-1.4.1
 ---
+**7/6: This could be a little undertrained. I'll update the weights if I end up training it longer and/or with better hyperparameters.**
 ---------------
 ## Relative Performance (perplexity)
 | Model                                                | Context (tokens)     | Perplexity |
 | ---------------------------------------------------- | ----------- | ---------- |
+| TheBloke/airoboros-13B-gpt4-1-4-GPTQ     | 512        |    **7.42**    |
 | TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 512        |    8.86    |
+| **bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ**    | 512    | 7.94   |
+| ---------------------------------------------------- | ----------- | ---------- |
+| TheBloke/airoboros-13B-gpt4-1-4-GPTQ     | 2048        |    **5.02**    |
 | TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 2048        |    5.98    |
+| **bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ**    | 2048   | 5.28   |
+| ---------------------------------------------------- | ----------- | ---------- |
+| TheBloke/airoboros-13B-gpt4-1-4-GPTQ     | 4096        |    9848.0    |
 | TheBloke/airoboros-13B-gpt4-1-4-SuperHOT-8K-GPTQ     | 4096        |    5.80    |
+| **bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ**    | 4096   | **5.15**   |
+- For contexts shorter than the original 2048, the original model has lower perplexity. This is consistent with the literature. The gap shrinks with context length, with the original becoming incoherent beyond this point.
+- I haven't used models with the SuperHOT LoRA enough to have any sense of performance differences, but feedback on the 33b variant suggests it is noticable particularly with coherence at longer context lengths.
 - This comparison isn't perfect. I did use the 1.4.1 dataset, the quantization method is slightly different, and the finetuning method is different (QLoRA vs full). In short, there are other potentially influential variables responsible for these performance differences.
 ## Quantization: