Edit model card

.

6bpw/h6 exl2 quantization of brucethemoose/Yi-34B-200K-RPMerge using default exllamav2 calibration dataset, to fully use my 31gb VRAM (-1 cuz windows..) with 16k-32k context.


ORIGINAL CARD:

RPMerge

A merge of several Yi 34B models with a singular goal: 40K+ context, instruct-enhanced storytelling.

Disappointed with some quirks of my previous kitchen sink merges (like token/instruct formats from various models showing up when they shouldn't), I've gone 'back to the basics' and picked a few Vicuna-format only models:

I consider this a more "focused" merge that previous ones. I will investigate other models (perhaps chatML models?) for a more "factual assistant" focused merge, as well as a coding-focused merge if I can't find one to suit my needs.

Prompt template: Orca-Vicuna

SYSTEM: {system_message}
USER: {prompt}
ASSISTANT:

Raw prompting as described here is also effective: https://old.reddit.com/r/LocalLLaMA/comments/18zqy4s/the_secret_to_writing_quality_stories_with_llms/

As well as a very explicit system prompt like this: https://old.reddit.com/r/LocalLLaMA/comments/1aiz6zu/roleplaying_system_prompts/koygiwa/

Running

Chinese models with large tokenizer vocabularies like Yi need careful parameter tuning due to their huge logit sampling "tails." Yi in particular also runs relatively "hot" even at lower temperatures.

I am a huge fan of Kalomaze's quadratic sampling (shown as "smoothing factor" where available), as described here: https://github.com/oobabooga/text-generation-webui/pull/5403

Otherwise, I recommend a lower temperature with 0.1 or higher MinP, a little repetition penalty, and mirostat with a low tau, and no other samplers. See the explanation here: https://github.com/ggerganov/llama.cpp/pull/3841

@MarinaraSpaghetti has extensively tested the model and recommended the following settings. They seem to work quite well:

{
    "temp": 1,
    "temperature_last": true,
    "top_p": 1,
    "top_k": 0,
    "top_a": 0,
    "tfs": 1,
    "epsilon_cutoff": 0,
    "eta_cutoff": 0,
    "typical_p": 0.9,
    "min_p": 0,
    "rep_pen": 1.1,
    "rep_pen_range": 19456,
    "no_repeat_ngram_size": 0,
    "penalty_alpha": 0,
    "num_beams": 1,
    "length_penalty": 0,
    "min_length": 0,
    "encoder_rep_pen": 1,
    "freq_pen": 0,
    "presence_pen": 0,
    "do_sample": true,
    "early_stopping": false,
    "dynatemp": false,
    "min_temp": 1,
    "max_temp": 2,
    "dynatemp_exponent": 1,
    "smoothing_factor": 0.33,
    "add_bos_token": false,
    "truncation_length": 2048,
    "ban_eos_token": false,
    "skip_special_tokens": true,
    "streaming": true,
    "mirostat_mode": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.1,
    "guidance_scale": 1,
    "negative_prompt": "",
    "grammar_string": "",
    "banned_tokens": "",
    "ignore_eos_token_aphrodite": false,
    "spaces_between_special_tokens_aphrodite": true,
    "sampler_order": [
        6,
        0,
        1,
        3,
        4,
        2,
        5
    ],
    "logit_bias": [],
    "n": 1,
    "rep_pen_size": 0,
    "genamt": 400,
    "max_length": 38912
}

24GB GPUs can efficiently run Yi-34B-200K models at 40K-90K context with exllamav2, and performant UIs like exui. I go into more detail in this post. Empty 16GB GPUs can still run the high context with aggressive quantization.

To load/train this in full-context backends like transformers, you must change max_position_embeddings in config.json to a lower value than 200,000, otherwise you will OOM! I do not recommend running high context without context-efficient backends that support flash attention + 8 bit kv cache, like exllamav2, litellm, vllm or unsloth.

Testing Notes

Thanks to ParasiticRogue for this idea of a Vicuna-only merge, see: https://huggingface.co/brucethemoose/jondurbin_bagel-dpo-34b-v0.2-exl2-4bpw-fiction/discussions

See: https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megamerge-v8#testing-notes

This is a possible base for a storytelling finetune/LASER in the future, once I can bite the bullet and rent some A100s or a MI300.

I have tested this merge with with novel-style continuation (but not much chat-style roleplay), and some assistant-style responses and long context analysis. I haven't seen any refusals so far.

Merge Details

Merge Method

This model was merged using the DARE TIES merge method using /home/alpha/Models/Raw/chargoddard_Yi-34B-200K-Llama as a base.

Models Merged

The following models were included in the merge:

  • /home/alpha/Models/Raw/migtissera_Tess-34B-v1.5b
  • /home/alpha/Models/Raw/migtissera_Tess-M-Creative-v1.0
  • /home/alpha/Models/Raw/cgato_Thespis-34b-DPO-v0.7
  • /home/alpha/Models/Raw/Nous-Capybara-34B
  • /home/alpha/Models/Raw/admo_limarp
  • /home/alpha/Models/Raw/DrNicefellow_ChatAllInOne-Yi-34B-200K-V1

Configuration

The following YAML configuration was used to produce this model:

models:
  - model: /home/alpha/Models/Raw/chargoddard_Yi-34B-200K-Llama
    # No parameters necessary for base model
  - model: /home/alpha/Models/Raw/migtissera_Tess-34B-v1.5b
    #Emphasize the beginning of Vicuna format models
    parameters:
      weight: 0.19
      density: 0.59
  - model: /home/alpha/Models/Raw/Nous-Capybara-34B
    parameters:
      weight: 0.19
      density: 0.55
  # Vicuna format
  - model: /home/alpha/Models/Raw/migtissera_Tess-M-Creative-v1.0
    parameters:
      weight: 0.05
      density: 0.55
  - model: /home/alpha/Models/Raw/DrNicefellow_ChatAllInOne-Yi-34B-200K-V1
    parameters:
      weight: 0.19
      density: 0.55
  - model: adamo1139/yi-34b-200k-rawrr-dpo-2+Doctor-Shotgun/limarpv3-yi-llama-34b-lora
    parameters:
      weight: 0.19
      density: 0.48
  - model: /home/alpha/Models/Raw/cgato_Thespis-34b-DPO-v0.7
    parameters:
      weight: 0.19
      density: 0.59


merge_method: dare_ties
tokenizer_source: union
base_model: /home/alpha/Models/Raw/chargoddard_Yi-34B-200K-Llama
parameters:
  int8_mask: true
dtype: bfloat16

Self Promotion

I'm part of a AI startup called Holocene AI!

We're new, busy, and still setting things up. But if you have any business inquiries, want a job, or just want some consultation, feel free to shoot me an email. We have expertise in RAG applications and llama/embeddings model finetuning, and absolutely none of the nonsense of scammy AI startups.

Contact me at: agates.holocene.ai@gmail.com

I also set up a Ko-Fi! I want to run some (personal) training/LASERing as well, at 100K context or so. If you'd like to buy me 10 minutes on an A100 (or 5 seconds on an MI300X), I'd appreciate it: https://ko-fi.com/alphaatlas

Downloads last month
3
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.