|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
library_name: transformers |
|
tags: |
|
- mergekit |
|
- merge |
|
- CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
|
|
--- |
|
# Extended Model |
|
|
|
This is a extension of a pre-trained language models created using [mergekit](https://github.com/arcee-ai/mergekit). |
|
|
|
![image/webp](https://cdn.prod.arcee.ai/images/clown-image.jpeg) |
|
|
|
# Merge Details |
|
### Merge Method |
|
|
|
This method employs mergekit's passthrough method to expand blocks within the "CorticalStack/pastiche-crown-clown-7b-dare-dpo" model. For every 5th layer, |
|
a new layer is added, with the `o_proj` and `down_proj` parameters of these added layers initialized to zero, mirroring the approach used in LLaMA Pro. |
|
|
|
### It's important to note that this configuration has not undergone fine-tuning. Therefore, when fine-tuning, ensure that only every 5th layer is trainable, while all other layers remain frozen. |
|
|
|
|
|
### Models Merged |
|
|
|
The following models were included in the merge: |
|
* [CorticalStack/pastiche-crown-clown-7b-dare-dpo](https://huggingface.co/CorticalStack/pastiche-crown-clown-7b-dare-dpo) |
|
|
|
## 🏆 Evaluation |
|
|
|
### OpenLLM |
|
|
|
CorticalStack/pastiche-crown-clown-7b-dare-dpo OpenLLM benchmark suite |
|
|
|
| Model | Average | arc | HellaSwag | mmlu | TruthfulQA | gsm8k | |
|
|---|---:|---:|---:|---:|---:|---:| |
|
| [CorticalStack/pastiche-crown-clown-7b-dare-dpo](https://huggingface.co/arcee-ai/Clown-DPO-Extended/) | 76.93 | 72.18 | 88.90 | 63.45 | 79.15 | 85.71 | 72.18 | |
|
|
|
### Configuration |
|
|
|
The following YAML configuration was used to produce this model: |
|
|
|
```yaml |
|
|
|
slices: |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 0 |
|
- 4 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 3 |
|
- 4 |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0 |
|
- filter: down_proj |
|
value: 0 |
|
- value: 1 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 4 |
|
- 8 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 7 |
|
- 8 |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0 |
|
- filter: down_proj |
|
value: 0 |
|
- value: 1 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 8 |
|
- 12 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 11 |
|
- 12 |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0 |
|
- filter: down_proj |
|
value: 0 |
|
- value: 1 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 12 |
|
- 16 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 15 |
|
- 16 |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0 |
|
- filter: down_proj |
|
value: 0 |
|
- value: 1 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 16 |
|
- 20 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 19 |
|
- 20 |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0 |
|
- filter: down_proj |
|
value: 0 |
|
- value: 1 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 20 |
|
- 24 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 23 |
|
- 24 |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0 |
|
- filter: down_proj |
|
value: 0 |
|
- value: 1 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 24 |
|
- 28 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 27 |
|
- 28 |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0 |
|
- filter: down_proj |
|
value: 0 |
|
- value: 1 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 28 |
|
- 32 |
|
- sources: |
|
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo |
|
layer_range: |
|
- 31 |
|
- 32 |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0 |
|
- filter: down_proj |
|
value: 0 |
|
- value: 1 |
|
merge_method: passthrough |
|
dtype: bfloat16 |
|
|
|
``` |
|
|
|
# Function to freeze layers |
|
|
|
``` |
|
from transformers import AutoModelForCausalLM |
|
|
|
def enable_grad_only_every_nth(model, n): |
|
""" |
|
This function configures the specified model to enable gradient calculations exclusively for every nth layer, starting |
|
from the first layer (0-indexed), to accommodate newly added blocks for training. Concurrently, it freezes the gradients |
|
for all other components of the model, including the embedding layers and the model's head. This setup is particularly |
|
useful for fine-tuning processes where only a subset of layers are targeted for updates, ensuring efficient training and |
|
adaptation of newly integrated layers while maintaining the pre-trained behavior of other model components. |
|
""" |
|
|
|
# Freeze embeddings. |
|
for param in model.model.embed_tokens.parameters(): |
|
param.requires_grad = False |
|
|
|
# Freeze lm_head. |
|
for param in model.lm_head.parameters(): |
|
param.requires_grad = False |
|
|
|
# Enable gradients for every nth layer |
|
layers = model.model.layers # Access the ModuleList containing the layers |
|
|
|
for index, layer in enumerate(layers): |
|
|
|
if (index + 1) % n == 0: # Enables gradients for every nth layer, starting from the layer after the 0th |
|
for param in layer.parameters(): |
|
param.requires_grad = True |
|
else: |
|
for param in layer.parameters(): |
|
param.requires_grad = False |
|
|
|
model = transformers.AutoModelForCausalLM.from_pretrained( |
|
"arcee-ai/Mistral-7B-Instruct-v0.2-expanded" |
|
) |
|
# Update layer gradients, specify the correct value for n based on your model's architecture |
|
n =5 |
|
enable_grad_only_every_nth(model, n) |
|
``` |
|
|
|
|