Update README.md
Browse files
README.md
CHANGED
@@ -30,16 +30,19 @@ This model requires the following prompt template:
|
|
30 |
|
31 |
## CHOICE OF MODELS
|
32 |
|
33 |
-
|
34 |
|
35 |
-
* Groupsize =
|
36 |
* Should work reliably in 24GB VRAM
|
|
|
|
|
|
|
|
|
37 |
* Groupsize = 128
|
38 |
* Optimal setting for highest inference quality
|
39 |
-
*
|
40 |
-
* In my testing it ran out of VRAM on a 24GB card around 1500 tokens returned.
|
41 |
|
42 |
-
For
|
43 |
* `compat.no-act-order.safetensor`
|
44 |
* Works with all versions of GPTQ-for-LLaMa, including the version in text-generation-webui one-click-installers
|
45 |
* `latest.act-order.safetensors`
|
@@ -50,10 +53,13 @@ For each model, two versions are available:
|
|
50 |
|
51 |
I have used branches to separate the models. This means you can clone the branch you want and not got model files you don't need.
|
52 |
|
53 |
-
|
54 |
-
|
55 |
-
* Branch: **
|
56 |
-
* Branch: **
|
|
|
|
|
|
|
57 |
|
58 |
![branches](https://i.imgur.com/PdiHnLxm.png)
|
59 |
|
@@ -68,7 +74,7 @@ Open the text-generation-webui UI as normal.
|
|
68 |
5. Click the **Refresh** icon next to **Model** in the top left.
|
69 |
6. In the **Model drop-down**: choose the model you just downloaded, `OpenAssistant-SFT-7-Llama-30B-GPTQ`.
|
70 |
7. If you see an error in the bottom right, ignore it - it's temporary.
|
71 |
-
8. Fill out the `GPTQ parameters` on the right: `Bits = 4`, `Groupsize =
|
72 |
9. Click **Save settings for this model** in the top right.
|
73 |
10. Click **Reload the Model** in the top right.
|
74 |
11. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|
|
|
30 |
|
31 |
## CHOICE OF MODELS
|
32 |
|
33 |
+
Three sets of models are provided:
|
34 |
|
35 |
+
* Groupsize = None
|
36 |
* Should work reliably in 24GB VRAM
|
37 |
+
* Uses --act-order for the best possible inference quality given its lack of group_size.
|
38 |
+
* Groupsize = 1024
|
39 |
+
* Theoretically higher inference accuracy
|
40 |
+
* May OOM on long context lengths in 24GB VRAM
|
41 |
* Groupsize = 128
|
42 |
* Optimal setting for highest inference quality
|
43 |
+
* Will definitely need more than 24GB VRAM on longer context lengths (1000-1500+ tokens returned)
|
|
|
44 |
|
45 |
+
For the 128g and 1024g models, two versions are available:
|
46 |
* `compat.no-act-order.safetensor`
|
47 |
* Works with all versions of GPTQ-for-LLaMa, including the version in text-generation-webui one-click-installers
|
48 |
* `latest.act-order.safetensors`
|
|
|
53 |
|
54 |
I have used branches to separate the models. This means you can clone the branch you want and not got model files you don't need.
|
55 |
|
56 |
+
If you have 24GB VRAM you are strongly recommended to use the file in `main`, with group_size = None. This is fully compatible, and won't OOM.
|
57 |
+
|
58 |
+
* Branch: **main** = groupsize None, `OpenAssistant-SFT-7-Llama-30B-GPTQ-4bit.safetensors` file
|
59 |
+
* Branch: **1024-compat** = groupsize 1024, `compat.no-act-order.safetensors` file
|
60 |
+
* Branch: **1024-latest** = groupsize 1024, `latest.act-order.safetensors` file
|
61 |
+
* Branch: **128-compat** = groupsize 128, `compat.no-act-order.safetensors` file
|
62 |
+
* Branch: **128-latest** = groupsize 128, `latest.act-order.safetensors` file
|
63 |
|
64 |
![branches](https://i.imgur.com/PdiHnLxm.png)
|
65 |
|
|
|
74 |
5. Click the **Refresh** icon next to **Model** in the top left.
|
75 |
6. In the **Model drop-down**: choose the model you just downloaded, `OpenAssistant-SFT-7-Llama-30B-GPTQ`.
|
76 |
7. If you see an error in the bottom right, ignore it - it's temporary.
|
77 |
+
8. Fill out the `GPTQ parameters` on the right: `Bits = 4`, `Groupsize = None`, `model_type = Llama`
|
78 |
9. Click **Save settings for this model** in the top right.
|
79 |
10. Click **Reload the Model** in the top right.
|
80 |
11. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|