brucethemoose
/

CapyTessBorosYi-34B-200K-DARE-Ties

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

brucethemoose commited on Dec 9, 2023

Commit

36fd7b4

•

1 Parent(s): 2f3eb91

Update README.md

Files changed (1) hide show

README.md +1 -11

README.md CHANGED Viewed

@@ -17,12 +17,8 @@ tags:
 https://github.com/yule-BUAA/MergeLM
 https://github.com/cg123/mergekit/tree/dare'
-***
-24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/), and recommend exl2 quantizations on data similar to the desired task, such as these targeted at story writing: [4.0bpw](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-4bpw-fiction) / [3.1bpw](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-3.1bpw-fiction)
-***
 Merged with the following config, and the tokenizer from chargoddard's Yi-Llama:
 ```
 models:
@@ -66,13 +62,7 @@ Being a Yi model, try disabling the BOS token and/or running a lower temperature
 Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition. It also might respond to the llama-2 chat format.
 ***
-I run Yi models in exui for maximum context size on 24GB GPUs. You can fit about 47K context on an empty GPU at 4bpw, and exui's speed really helps at high context:
-https://github.com/turboderp/exui
-https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-4bpw-fiction
 ***
 Credits:

 https://github.com/yule-BUAA/MergeLM
 https://github.com/cg123/mergekit/tree/dare'
 Merged with the following config, and the tokenizer from chargoddard's Yi-Llama:
 ```
 models:
 Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition. It also might respond to the llama-2 chat format.
 ***
+24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/), and recommend exl2 quantizations on data similar to the desired task, such as these targeted at story writing: [4.0bpw](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-4bpw-fiction) / [3.1bpw](https://huggingface.co/brucethemoose/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-3.1bpw-fiction)
 ***
 Credits: