MN-12B-Lyra-v1 - EXL2 8bpw max

This is a 8bpw EXL2 quant of Sao10K/MN-12B-Lyra-v1

This quant was made using exllamav2-0.1.8 with default dataset. I used a slightly modified quantization script to force use of highest bpw methods for all layers in the model (which is usually "1:8b_128g s4") to ensure max quality.

I also added a small fix in config file to set max default context at 128k as original Mistral-Nemo should have.

I tested this quant shortly in some random RPs (including ones over 8k context) and it seems to work fine.

Prompt Templates

Uses Mistral or ChatML format like mentioned below.

Original readme below

Mistral-Nemo-12B-Lyra-v1

Anyway. Experimental general roleplaying model.

It works fine enough? Scored pretty high in EQ-Bench [Nemo-RP-v4 - 77.41] , right below Nemomix v4 [77.92] which was well, a big merge. Not bad.
I wanted to run the Creative Writing benchmark but it was too slow to run, for some reason.
---> EQ-Bench Scores

From my testing the regular 1.2 temp + 0.1 min_p works pretty nice. Or go lower temp, as Nemo is good at < 1 temp too.

Prompting Format:

Either [INST] or ChatML works fine. # Why? Merged two differently formatted trains that had some data variation. One on Mistral Instruct, one on ChatML.

Details

- As I said, this was a merge of two models, of which the dataset is pretty much the same, one actually includes roleplay and creative writing, the other one does not, and is more focused on instruct and smarts.
- Model A and Model B are each trained on different formats individually.
- Tokenizer and all are taken from base Nemo 12B, so there are no token conflicts.
- A merge between these models with seperated datasets seem to do better, compared to the dataset being mixed together. I have tried shuffled, and non shuffled data mixes.
- Perhaps it would work for Full-Finetunes, but I am limited up to LoRAs for now.
- For merge methods, della_linear method worked best for this run specifically, according to internal self benchmarks and blind-preference tests.
- Best merge methods may be different for different model types and sizes. On a seperate Llama 3 experiment, Ties-Rescaled worked best.

My Current Findings:
- After tinkering with Nemo, it is kind of clear for me that the base itself is unruly to train on, for my datasets. I'd need to SFT first, then use that as a base.
- Nemo may train well, but like Mistral it is kind of... dry. Bland even with unique, creative and varied data. It needs multi-stage fine-tuning. Llama 3 does not have this... issue?
- Nemo's effective context is unfortunately kind of a bummer? It's effective max is 16K, I have tried loras with up to 64K trains on a lot of samples, they just do not work well, unlike on Yi.
- For roleplay, 16K context is plenty enough, so that is fine.

Further Iterations:

Previous version uploaded was a beta. # It had tokenizer issues lol.

This is simply v1. I have a lot more ideas to improve upon this, the data is being cooked right now. Those might come in a bit.

My upcoming plans:
- RL on a specially curated dataset, to target instruction following over multi-turn and creative writing abilities.
- Iterate upon previous versions with more varied data sources and types, on various domains ala Nitral's Hathor work. He's a cool guy.

Have a good day.

DeusImperator
/

MN-12B-Lyra-v1_exl2_8bpw_max