Mark-Arcee
commited on
Commit
•
a6c74a9
1
Parent(s):
71bac2b
Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,11 @@ This is a extension of a pre-trained language models created using [mergekit](ht
|
|
18 |
# Merge Details
|
19 |
### Merge Method
|
20 |
|
21 |
-
This
|
|
|
|
|
|
|
|
|
22 |
|
23 |
### Models Merged
|
24 |
|
@@ -181,7 +185,47 @@ The following YAML configuration was used to produce this model:
|
|
181 |
merge_method: passthrough
|
182 |
dtype: bfloat16
|
183 |
|
|
|
184 |
|
|
|
185 |
|
186 |
-
|
187 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
# Merge Details
|
19 |
### Merge Method
|
20 |
|
21 |
+
This method employs mergekit's passthrough method to expand blocks within the "CorticalStack/pastiche-crown-clown-7b-dare-dpo" model. For every 5th layer,
|
22 |
+
a new layer is added, with the `o_proj` and `down_proj` parameters of these added layers initialized to zero, mirroring the approach used in LLaMA Pro.
|
23 |
+
|
24 |
+
### It's important to note that this configuration has not undergone fine-tuning. Therefore, when fine-tuning, ensure that only every 5th layer is trainable, while all other layers remain frozen.
|
25 |
+
|
26 |
|
27 |
### Models Merged
|
28 |
|
|
|
185 |
merge_method: passthrough
|
186 |
dtype: bfloat16
|
187 |
|
188 |
+
```
|
189 |
|
190 |
+
# Function to freeze layers
|
191 |
|
|
|
192 |
```
|
193 |
+
from transformers import AutoModelForCausalLM
|
194 |
+
|
195 |
+
def enable_grad_only_every_nth(model, n):
|
196 |
+
"""
|
197 |
+
This function configures the specified model to enable gradient calculations exclusively for every nth layer, starting
|
198 |
+
from the first layer (0-indexed), to accommodate newly added blocks for training. Concurrently, it freezes the gradients
|
199 |
+
for all other components of the model, including the embedding layers and the model's head. This setup is particularly
|
200 |
+
useful for fine-tuning processes where only a subset of layers are targeted for updates, ensuring efficient training and
|
201 |
+
adaptation of newly integrated layers while maintaining the pre-trained behavior of other model components.
|
202 |
+
"""
|
203 |
+
|
204 |
+
# Freeze embeddings.
|
205 |
+
for param in model.model.embed_tokens.parameters():
|
206 |
+
param.requires_grad = False
|
207 |
+
|
208 |
+
# Freeze lm_head.
|
209 |
+
for param in model.lm_head.parameters():
|
210 |
+
param.requires_grad = False
|
211 |
+
|
212 |
+
# Enable gradients for every nth layer
|
213 |
+
layers = model.model.layers # Access the ModuleList containing the layers
|
214 |
+
|
215 |
+
for index, layer in enumerate(layers):
|
216 |
+
|
217 |
+
if (index + 1) % n == 0: # Enables gradients for every nth layer, starting from the layer after the 0th
|
218 |
+
for param in layer.parameters():
|
219 |
+
param.requires_grad = True
|
220 |
+
else:
|
221 |
+
for param in layer.parameters():
|
222 |
+
param.requires_grad = False
|
223 |
+
|
224 |
+
model = transformers.AutoModelForCausalLM.from_pretrained(
|
225 |
+
"arcee-ai/Mistral-7B-Instruct-v0.2-expanded"
|
226 |
+
)
|
227 |
+
# Update layer gradients, specify the correct value for n based on your model's architecture
|
228 |
+
n =5
|
229 |
+
enable_grad_only_every_nth(model, n)
|
230 |
+
```
|
231 |
+
|