File size: 6,694 Bytes
58d1bb0
7f99745
58d1bb0
 
 
 
 
 
9292d99
58d1bb0
 
7f99745
58d1bb0
71bac2b
9292d99
 
58d1bb0
9292d99
58d1bb0
 
a6c74a9
 
 
 
 
58d1bb0
 
 
 
 
 
9292d99
 
 
 
 
 
 
 
 
 
58d1bb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6c74a9
58d1bb0
a6c74a9
58d1bb0
 
a6c74a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
license: apache-2.0
base_model:
- CorticalStack/pastiche-crown-clown-7b-dare-dpo
library_name: transformers
tags:
- mergekit
- merge
- CorticalStack/pastiche-crown-clown-7b-dare-dpo

---
# Extended Model

This is a extension of a pre-trained language models created using [mergekit](https://github.com/arcee-ai/mergekit).

![image/webp](https://cdn.prod.arcee.ai/images/clown-image.jpeg)

# Merge Details
### Merge Method

This method employs mergekit's passthrough method to expand blocks within the "CorticalStack/pastiche-crown-clown-7b-dare-dpo" model. For every 5th layer, 
a new layer is added, with the `o_proj` and `down_proj` parameters of these added layers initialized to zero, mirroring the approach used in LLaMA Pro. 

### It's important to note that this configuration has not undergone fine-tuning. Therefore, when fine-tuning, ensure that only every 5th layer is trainable, while all other layers remain frozen.


### Models Merged

The following models were included in the merge:
* [CorticalStack/pastiche-crown-clown-7b-dare-dpo](https://huggingface.co/CorticalStack/pastiche-crown-clown-7b-dare-dpo)

## 🏆 Evaluation

### OpenLLM

CorticalStack/pastiche-crown-clown-7b-dare-dpo OpenLLM benchmark suite

| Model | Average | arc | HellaSwag | mmlu | TruthfulQA | gsm8k |
|---|---:|---:|---:|---:|---:|---:|
| [CorticalStack/pastiche-crown-clown-7b-dare-dpo](https://huggingface.co/arcee-ai/Clown-DPO-Extended/) | 76.93 | 72.18 | 88.90 | 63.45 | 79.15 | 85.71 | 72.18 |

### Configuration

The following YAML configuration was used to produce this model:

```yaml

  slices:
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 0
          - 4
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 3
          - 4
        parameters:
          scale:
            - filter: o_proj
              value: 0
            - filter: down_proj
              value: 0
            - value: 1
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 4
          - 8
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 7
          - 8
        parameters:
          scale:
            - filter: o_proj
              value: 0
            - filter: down_proj
              value: 0
            - value: 1
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 8
          - 12
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 11
          - 12
        parameters:
          scale:
            - filter: o_proj
              value: 0
            - filter: down_proj
              value: 0
            - value: 1
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 12
          - 16
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 15
          - 16
        parameters:
          scale:
            - filter: o_proj
              value: 0
            - filter: down_proj
              value: 0
            - value: 1
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 16
          - 20
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 19
          - 20
        parameters:
          scale:
            - filter: o_proj
              value: 0
            - filter: down_proj
              value: 0
            - value: 1
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 20
          - 24
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 23
          - 24
        parameters:
          scale:
            - filter: o_proj
              value: 0
            - filter: down_proj
              value: 0
            - value: 1
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 24
          - 28
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 27
          - 28
        parameters:
          scale:
            - filter: o_proj
              value: 0
            - filter: down_proj
              value: 0
            - value: 1
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 28
          - 32
  - sources:
      - model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
        layer_range:
          - 31
          - 32
        parameters:
          scale:
            - filter: o_proj
              value: 0
            - filter: down_proj
              value: 0
            - value: 1
  merge_method: passthrough
  dtype: bfloat16

```

# Function to freeze layers

```
from transformers import AutoModelForCausalLM

def enable_grad_only_every_nth(model, n):
    """
    This function configures the specified model to enable gradient calculations exclusively for every nth layer, starting 
    from the first layer (0-indexed), to accommodate newly added blocks for training. Concurrently, it freezes the gradients 
    for all other components of the model, including the embedding layers and the model's head. This setup is particularly 
    useful for fine-tuning processes where only a subset of layers are targeted for updates, ensuring efficient training and 
    adaptation of newly integrated layers while maintaining the pre-trained behavior of other model components.
    """

    # Freeze embeddings.
    for param in model.model.embed_tokens.parameters():
        param.requires_grad = False

    # Freeze lm_head.
    for param in model.lm_head.parameters():
        param.requires_grad = False

    # Enable gradients for every nth layer
    layers = model.model.layers  # Access the ModuleList containing the layers

    for index, layer in enumerate(layers):

        if (index + 1) % n == 0:  # Enables gradients for every nth layer, starting from the layer after the 0th
            for param in layer.parameters():
                param.requires_grad = True
        else:
            for param in layer.parameters():
                param.requires_grad = False

model = transformers.AutoModelForCausalLM.from_pretrained(
    "arcee-ai/Mistral-7B-Instruct-v0.2-expanded"
    )
# Update layer gradients, specify the correct value for n based on your model's architecture
n =5
enable_grad_only_every_nth(model, n)
```