brucethemoose
commited on
Commit
•
36fd7b4
1
Parent(s):
2f3eb91
Update README.md
Browse files
README.md
CHANGED
@@ -17,12 +17,8 @@ tags:
|
|
17 |
https://github.com/yule-BUAA/MergeLM
|
18 |
|
19 |
https://github.com/cg123/mergekit/tree/dare'
|
20 |
-
***
|
21 |
|
22 |
|
23 |
-
24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/), and recommend exl2 quantizations on data similar to the desired task, such as these targeted at story writing: [4.0bpw](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-4bpw-fiction) / [3.1bpw](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-3.1bpw-fiction)
|
24 |
-
***
|
25 |
-
|
26 |
Merged with the following config, and the tokenizer from chargoddard's Yi-Llama:
|
27 |
```
|
28 |
models:
|
@@ -66,13 +62,7 @@ Being a Yi model, try disabling the BOS token and/or running a lower temperature
|
|
66 |
Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition. It also might respond to the llama-2 chat format.
|
67 |
|
68 |
***
|
69 |
-
|
70 |
-
I run Yi models in exui for maximum context size on 24GB GPUs. You can fit about 47K context on an empty GPU at 4bpw, and exui's speed really helps at high context:
|
71 |
-
|
72 |
-
https://github.com/turboderp/exui
|
73 |
-
|
74 |
-
https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-4bpw-fiction
|
75 |
-
|
76 |
***
|
77 |
|
78 |
Credits:
|
|
|
17 |
https://github.com/yule-BUAA/MergeLM
|
18 |
|
19 |
https://github.com/cg123/mergekit/tree/dare'
|
|
|
20 |
|
21 |
|
|
|
|
|
|
|
22 |
Merged with the following config, and the tokenizer from chargoddard's Yi-Llama:
|
23 |
```
|
24 |
models:
|
|
|
62 |
Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition. It also might respond to the llama-2 chat format.
|
63 |
|
64 |
***
|
65 |
+
24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/), and recommend exl2 quantizations on data similar to the desired task, such as these targeted at story writing: [4.0bpw](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-4bpw-fiction) / [3.1bpw](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-3.1bpw-fiction)
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
***
|
67 |
|
68 |
Credits:
|