I thought 21B is a placebo? Is it?

#1
by UniversalLove333 - opened

Fizzarolli From the Beaver team said 21B is just extrla layers and is a placebo? I'm confused?

BeaverAI org

@Fizzarolli ...?

He might be referring to https://huggingface.co/TheSkullery/NeMoria-21b which is essentially Nemo 12B with extra (empty) layers.

However, once you train on top of it (like NeMoist v1a), the layers get filled up. Whether or not that will help is TBD.

BeaverAI org
β€’
edited Aug 1

TheSkullery/NeMoria-21b, which BeaverAI/NeMoist-21B-v1a was trained on, can be seen as placebo, not BeaverAI/NeMoist-21B-v1a itself.

SteelSkull took mistralai/Mistral-Nemo-Instruct-2407 and added layers to increase the size to 21B, though the added layers are "Zeroed". "Zeroing" is when o_proj and down_proj are set to zero on the extra layers. This hopefully leads to having a better starting point to train from, since normally if you add extra layers the quality decreases instead of increasing (based on benchmarks. some may find it increases creativity, but this could be because its becoming less coherent).

There is likely no good reason to use TheSkullery/NeMoria-21b by itself, since it should have the same quality as mistralai/Mistral-Nemo-Instruct-2407 but uses more memory and will be slower. It should only really be used as a base for training, which is what is being done here.

The method originally comes from the LLaMa-Pro paper (with the related code being here).


Charles Goddard (creator of MergeKit) then did a test using this method here to increase the size of Mistral-v0.1-7B to become 11B, and found it to work well.

The base model for this came from a variation on Undi's Mistral 11B recipe. The o_proj and down_proj tensors were set to zero in the added layers, making the output exactly identical to Mistral 7B before training.

Benchmarks look good locally but still evaluating actual usefulness.
Update: this turned out great! 10/10 would recommend as a training approach.


After that Elinas took a shot at it here & here using the same method with LLaMa-3-8B to increase the size to 15B.

This is a QLoRA finetune of a merge of pre-trained language models created using mergekit.

The model is based on a "zeroed" passthrough merge of Llama-3-15B-Instruct-zeroed

This was primarily an experiment to see how a passthrough merge will respond to further finetuning, though this was done on a small dataset.

This is a QLoRA model and all of the LoRA modules were targeted this time to ensure sufficient training before moving on to larger datasets. The first version of this model only targeted o_proj and up_proj.


Elinas and SteelSkull then teamed up to do a larger training attempt here using the same "zeroed" 15B to start with.

Sign up or log in to comment