File size: 6,694 Bytes
58d1bb0 7f99745 58d1bb0 9292d99 58d1bb0 7f99745 58d1bb0 71bac2b 9292d99 58d1bb0 9292d99 58d1bb0 a6c74a9 58d1bb0 9292d99 58d1bb0 a6c74a9 58d1bb0 a6c74a9 58d1bb0 a6c74a9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
---
license: apache-2.0
base_model:
- CorticalStack/pastiche-crown-clown-7b-dare-dpo
library_name: transformers
tags:
- mergekit
- merge
- CorticalStack/pastiche-crown-clown-7b-dare-dpo
---
# Extended Model
This is a extension of a pre-trained language models created using [mergekit](https://github.com/arcee-ai/mergekit).
![image/webp](https://cdn.prod.arcee.ai/images/clown-image.jpeg)
# Merge Details
### Merge Method
This method employs mergekit's passthrough method to expand blocks within the "CorticalStack/pastiche-crown-clown-7b-dare-dpo" model. For every 5th layer,
a new layer is added, with the `o_proj` and `down_proj` parameters of these added layers initialized to zero, mirroring the approach used in LLaMA Pro.
### It's important to note that this configuration has not undergone fine-tuning. Therefore, when fine-tuning, ensure that only every 5th layer is trainable, while all other layers remain frozen.
### Models Merged
The following models were included in the merge:
* [CorticalStack/pastiche-crown-clown-7b-dare-dpo](https://huggingface.co/CorticalStack/pastiche-crown-clown-7b-dare-dpo)
## 🏆 Evaluation
### OpenLLM
CorticalStack/pastiche-crown-clown-7b-dare-dpo OpenLLM benchmark suite
| Model | Average | arc | HellaSwag | mmlu | TruthfulQA | gsm8k |
|---|---:|---:|---:|---:|---:|---:|
| [CorticalStack/pastiche-crown-clown-7b-dare-dpo](https://huggingface.co/arcee-ai/Clown-DPO-Extended/) | 76.93 | 72.18 | 88.90 | 63.45 | 79.15 | 85.71 | 72.18 |
### Configuration
The following YAML configuration was used to produce this model:
```yaml
slices:
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 0
- 4
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 3
- 4
parameters:
scale:
- filter: o_proj
value: 0
- filter: down_proj
value: 0
- value: 1
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 4
- 8
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 7
- 8
parameters:
scale:
- filter: o_proj
value: 0
- filter: down_proj
value: 0
- value: 1
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 8
- 12
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 11
- 12
parameters:
scale:
- filter: o_proj
value: 0
- filter: down_proj
value: 0
- value: 1
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 12
- 16
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 15
- 16
parameters:
scale:
- filter: o_proj
value: 0
- filter: down_proj
value: 0
- value: 1
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 16
- 20
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 19
- 20
parameters:
scale:
- filter: o_proj
value: 0
- filter: down_proj
value: 0
- value: 1
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 20
- 24
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 23
- 24
parameters:
scale:
- filter: o_proj
value: 0
- filter: down_proj
value: 0
- value: 1
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 24
- 28
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 27
- 28
parameters:
scale:
- filter: o_proj
value: 0
- filter: down_proj
value: 0
- value: 1
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 28
- 32
- sources:
- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
layer_range:
- 31
- 32
parameters:
scale:
- filter: o_proj
value: 0
- filter: down_proj
value: 0
- value: 1
merge_method: passthrough
dtype: bfloat16
```
# Function to freeze layers
```
from transformers import AutoModelForCausalLM
def enable_grad_only_every_nth(model, n):
"""
This function configures the specified model to enable gradient calculations exclusively for every nth layer, starting
from the first layer (0-indexed), to accommodate newly added blocks for training. Concurrently, it freezes the gradients
for all other components of the model, including the embedding layers and the model's head. This setup is particularly
useful for fine-tuning processes where only a subset of layers are targeted for updates, ensuring efficient training and
adaptation of newly integrated layers while maintaining the pre-trained behavior of other model components.
"""
# Freeze embeddings.
for param in model.model.embed_tokens.parameters():
param.requires_grad = False
# Freeze lm_head.
for param in model.lm_head.parameters():
param.requires_grad = False
# Enable gradients for every nth layer
layers = model.model.layers # Access the ModuleList containing the layers
for index, layer in enumerate(layers):
if (index + 1) % n == 0: # Enables gradients for every nth layer, starting from the layer after the 0th
for param in layer.parameters():
param.requires_grad = True
else:
for param in layer.parameters():
param.requires_grad = False
model = transformers.AutoModelForCausalLM.from_pretrained(
"arcee-ai/Mistral-7B-Instruct-v0.2-expanded"
)
# Update layer gradients, specify the correct value for n based on your model's architecture
n =5
enable_grad_only_every_nth(model, n)
```
|