Update README.md

a6c74a9 verified 9 months ago

6.69 kB

	---
	license: apache-2.0
	base_model:
	- CorticalStack/pastiche-crown-clown-7b-dare-dpo
	library_name: transformers
	tags:
	- mergekit
	- merge
	- CorticalStack/pastiche-crown-clown-7b-dare-dpo

	---
	# Extended Model

	This is a extension of a pre-trained language models created using [mergekit](https://github.com/arcee-ai/mergekit).

	![image/webp](https://cdn.prod.arcee.ai/images/clown-image.jpeg)

	# Merge Details
	### Merge Method

	This method employs mergekit's passthrough method to expand blocks within the "CorticalStack/pastiche-crown-clown-7b-dare-dpo" model. For every 5th layer,
	a new layer is added, with the `o_proj` and `down_proj` parameters of these added layers initialized to zero, mirroring the approach used in LLaMA Pro.

	### It's important to note that this configuration has not undergone fine-tuning. Therefore, when fine-tuning, ensure that only every 5th layer is trainable, while all other layers remain frozen.


	### Models Merged

	The following models were included in the merge:
	* [CorticalStack/pastiche-crown-clown-7b-dare-dpo](https://huggingface.co/CorticalStack/pastiche-crown-clown-7b-dare-dpo)

	## 🏆 Evaluation

	### OpenLLM

	CorticalStack/pastiche-crown-clown-7b-dare-dpo OpenLLM benchmark suite

	\| Model \| Average \| arc \| HellaSwag \| mmlu \| TruthfulQA \| gsm8k \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| [CorticalStack/pastiche-crown-clown-7b-dare-dpo](https://huggingface.co/arcee-ai/Clown-DPO-Extended/) \| 76.93 \| 72.18 \| 88.90 \| 63.45 \| 79.15 \| 85.71 \| 72.18 \|

	### Configuration

	The following YAML configuration was used to produce this model:

	```yaml

	slices:
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 0
	- 4
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 3
	- 4
	parameters:
	scale:
	- filter: o_proj
	value: 0
	- filter: down_proj
	value: 0
	- value: 1
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 4
	- 8
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 7
	- 8
	parameters:
	scale:
	- filter: o_proj
	value: 0
	- filter: down_proj
	value: 0
	- value: 1
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 8
	- 12
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 11
	- 12
	parameters:
	scale:
	- filter: o_proj
	value: 0
	- filter: down_proj
	value: 0
	- value: 1
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 12
	- 16
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 15
	- 16
	parameters:
	scale:
	- filter: o_proj
	value: 0
	- filter: down_proj
	value: 0
	- value: 1
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 16
	- 20
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 19
	- 20
	parameters:
	scale:
	- filter: o_proj
	value: 0
	- filter: down_proj
	value: 0
	- value: 1
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 20
	- 24
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 23
	- 24
	parameters:
	scale:
	- filter: o_proj
	value: 0
	- filter: down_proj
	value: 0
	- value: 1
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 24
	- 28
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 27
	- 28
	parameters:
	scale:
	- filter: o_proj
	value: 0
	- filter: down_proj
	value: 0
	- value: 1
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 28
	- 32
	- sources:
	- model: CorticalStack/pastiche-crown-clown-7b-dare-dpo
	layer_range:
	- 31
	- 32
	parameters:
	scale:
	- filter: o_proj
	value: 0
	- filter: down_proj
	value: 0
	- value: 1
	merge_method: passthrough
	dtype: bfloat16

	```

	# Function to freeze layers

	```
	from transformers import AutoModelForCausalLM

	def enable_grad_only_every_nth(model, n):
	"""
	This function configures the specified model to enable gradient calculations exclusively for every nth layer, starting
	from the first layer (0-indexed), to accommodate newly added blocks for training. Concurrently, it freezes the gradients
	for all other components of the model, including the embedding layers and the model's head. This setup is particularly
	useful for fine-tuning processes where only a subset of layers are targeted for updates, ensuring efficient training and
	adaptation of newly integrated layers while maintaining the pre-trained behavior of other model components.
	"""

	# Freeze embeddings.
	for param in model.model.embed_tokens.parameters():
	param.requires_grad = False

	# Freeze lm_head.
	for param in model.lm_head.parameters():
	param.requires_grad = False

	# Enable gradients for every nth layer
	layers = model.model.layers # Access the ModuleList containing the layers

	for index, layer in enumerate(layers):

	if (index + 1) % n == 0: # Enables gradients for every nth layer, starting from the layer after the 0th
	for param in layer.parameters():
	param.requires_grad = True
	else:
	for param in layer.parameters():
	param.requires_grad = False

	model = transformers.AutoModelForCausalLM.from_pretrained(
	"arcee-ai/Mistral-7B-Instruct-v0.2-expanded"
	)
	# Update layer gradients, specify the correct value for n based on your model's architecture
	n =5
	enable_grad_only_every_nth(model, n)
	```