Update README.md

23a2710 verified 3 months ago

16.6 kB

	---
	license: llama3.1
	base_model: meta-llama/Meta-Llama-3.1-8B
	tags:
	- generated_from_trainer
	model-index:
	- name: workspace/axolotl/dolphin-2.9.4-llama3.1-8b
	results: []
	---

	# warning is not working yet, recommend hold off on downloading

	<details><summary>Evals</summary>

	```
	hf (pretrained=/workspace/axolotl/dolphin-2.9.4-llama3.1-8b-hf,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (4)
	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-----------------------------------------------------------\|-------\|------\|-----:\|-----------------------\|---\|-----:\|---\|------\|
	\|leaderboard \|N/A \|none \| 0\|acc \|↑ \|0.2926\|± \|0.0041\|
	\| \| \|none \| 0\|acc_norm \|↑ \|0.4513\|± \|0.0053\|
	\| \| \|none \| 0\|exact_match \|↑ \|0.0982\|± \|0.0079\|
	\| \| \|none \| 0\|inst_level_loose_acc \|↑ \|0.3825\|± \|N/A \|
	\| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.3597\|± \|N/A \|
	\| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.2421\|± \|0.0184\|
	\| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.2181\|± \|0.0178\|
	\| - leaderboard_bbh \|N/A \|none \| 3\|acc_norm \|↑ \|0.4931\|± \|0.0061\|
	\| - leaderboard_bbh_boolean_expressions \| 0\|none \| 3\|acc_norm \|↑ \|0.8000\|± \|0.0253\|
	\| - leaderboard_bbh_causal_judgement \| 0\|none \| 3\|acc_norm \|↑ \|0.5615\|± \|0.0364\|
	\| - leaderboard_bbh_date_understanding \| 0\|none \| 3\|acc_norm \|↑ \|0.4520\|± \|0.0315\|
	\| - leaderboard_bbh_disambiguation_qa \| 0\|none \| 3\|acc_norm \|↑ \|0.6640\|± \|0.0299\|
	\| - leaderboard_bbh_formal_fallacies \| 0\|none \| 3\|acc_norm \|↑ \|0.5600\|± \|0.0315\|
	\| - leaderboard_bbh_geometric_shapes \| 0\|none \| 3\|acc_norm \|↑ \|0.3640\|± \|0.0305\|
	\| - leaderboard_bbh_hyperbaton \| 0\|none \| 3\|acc_norm \|↑ \|0.6320\|± \|0.0306\|
	\| - leaderboard_bbh_logical_deduction_five_objects \| 0\|none \| 3\|acc_norm \|↑ \|0.4600\|± \|0.0316\|
	\| - leaderboard_bbh_logical_deduction_seven_objects \| 0\|none \| 3\|acc_norm \|↑ \|0.4360\|± \|0.0314\|
	\| - leaderboard_bbh_logical_deduction_three_objects \| 0\|none \| 3\|acc_norm \|↑ \|0.6160\|± \|0.0308\|
	\| - leaderboard_bbh_movie_recommendation \| 0\|none \| 3\|acc_norm \|↑ \|0.7880\|± \|0.0259\|
	\| - leaderboard_bbh_navigate \| 0\|none \| 3\|acc_norm \|↑ \|0.5200\|± \|0.0317\|
	\| - leaderboard_bbh_object_counting \| 0\|none \| 3\|acc_norm \|↑ \|0.4520\|± \|0.0315\|
	\| - leaderboard_bbh_penguins_in_a_table \| 0\|none \| 3\|acc_norm \|↑ \|0.5205\|± \|0.0415\|
	\| - leaderboard_bbh_reasoning_about_colored_objects \| 0\|none \| 3\|acc_norm \|↑ \|0.5120\|± \|0.0317\|
	\| - leaderboard_bbh_ruin_names \| 0\|none \| 3\|acc_norm \|↑ \|0.6320\|± \|0.0306\|
	\| - leaderboard_bbh_salient_translation_error_detection \| 0\|none \| 3\|acc_norm \|↑ \|0.4320\|± \|0.0314\|
	\| - leaderboard_bbh_snarks \| 0\|none \| 3\|acc_norm \|↑ \|0.5843\|± \|0.0370\|
	\| - leaderboard_bbh_sports_understanding \| 0\|none \| 3\|acc_norm \|↑ \|0.7040\|± \|0.0289\|
	\| - leaderboard_bbh_temporal_sequences \| 0\|none \| 3\|acc_norm \|↑ \|0.1440\|± \|0.0222\|
	\| - leaderboard_bbh_tracking_shuffled_objects_five_objects \| 0\|none \| 3\|acc_norm \|↑ \|0.1560\|± \|0.0230\|
	\| - leaderboard_bbh_tracking_shuffled_objects_seven_objects\| 0\|none \| 3\|acc_norm \|↑ \|0.1320\|± \|0.0215\|
	\| - leaderboard_bbh_tracking_shuffled_objects_three_objects\| 0\|none \| 3\|acc_norm \|↑ \|0.2840\|± \|0.0286\|
	\| - leaderboard_bbh_web_of_lies \| 0\|none \| 3\|acc_norm \|↑ \|0.4840\|± \|0.0317\|
	\| - leaderboard_gpqa \|N/A \|none \| 0\|acc_norm \|↑ \|0.2903\|± \|0.0132\|
	\| - leaderboard_gpqa_diamond \| 1\|none \| 0\|acc_norm \|↑ \|0.2980\|± \|0.0326\|
	\| - leaderboard_gpqa_extended \| 1\|none \| 0\|acc_norm \|↑ \|0.2839\|± \|0.0193\|
	\| - leaderboard_gpqa_main \| 1\|none \| 0\|acc_norm \|↑ \|0.2946\|± \|0.0216\|
	\| - leaderboard_ifeval \| 2\|none \| 0\|inst_level_loose_acc \|↑ \|0.3825\|± \|N/A \|
	\| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.3597\|± \|N/A \|
	\| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.2421\|± \|0.0184\|
	\| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.2181\|± \|0.0178\|
	\| - leaderboard_math_algebra_hard \| 1\|none \| 4\|exact_match \|↑ \|0.1596\|± \|0.0209\|
	\| - leaderboard_math_counting_and_prob_hard \| 1\|none \| 4\|exact_match \|↑ \|0.0488\|± \|0.0195\|
	\| - leaderboard_math_geometry_hard \| 1\|none \| 4\|exact_match \|↑ \|0.0530\|± \|0.0196\|
	\| - leaderboard_math_hard \|N/A \|none \| 4\|exact_match \|↑ \|0.0982\|± \|0.0079\|
	\| - leaderboard_math_intermediate_algebra_hard \| 1\|none \| 4\|exact_match \|↑ \|0.0143\|± \|0.0071\|
	\| - leaderboard_math_num_theory_hard \| 1\|none \| 4\|exact_match \|↑ \|0.0455\|± \|0.0168\|
	\| - leaderboard_math_prealgebra_hard \| 1\|none \| 4\|exact_match \|↑ \|0.2591\|± \|0.0316\|
	\| - leaderboard_math_precalculus_hard \| 1\|none \| 4\|exact_match \|↑ \|0.0519\|± \|0.0192\|
	\| - leaderboard_mmlu_pro \| 0.1\|none \| 5\|acc \|↑ \|0.2926\|± \|0.0041\|
	\| - leaderboard_musr \|N/A \|none \| 0\|acc_norm \|↑ \|0.3862\|± \|0.0173\|
	\| - leaderboard_musr_murder_mysteries \| 1\|none \| 0\|acc_norm \|↑ \|0.5280\|± \|0.0316\|
	\| - leaderboard_musr_object_placements \| 1\|none \| 0\|acc_norm \|↑ \|0.3594\|± \|0.0300\|
	\| - leaderboard_musr_team_allocation \| 1\|none \| 0\|acc_norm \|↑ \|0.2720\|± \|0.0282\|

	\| Groups \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|------------------------\|-------\|------\|-----:\|-----------------------\|---\|-----:\|---\|------\|
	\|leaderboard \|N/A \|none \| 0\|acc \|↑ \|0.2926\|± \|0.0041\|
	\| \| \|none \| 0\|acc_norm \|↑ \|0.4513\|± \|0.0053\|
	\| \| \|none \| 0\|exact_match \|↑ \|0.0982\|± \|0.0079\|
	\| \| \|none \| 0\|inst_level_loose_acc \|↑ \|0.3825\|± \|N/A \|
	\| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.3597\|± \|N/A \|
	\| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.2421\|± \|0.0184\|
	\| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.2181\|± \|0.0178\|
	\| - leaderboard_bbh \|N/A \|none \| 3\|acc_norm \|↑ \|0.4931\|± \|0.0061\|
	\| - leaderboard_gpqa \|N/A \|none \| 0\|acc_norm \|↑ \|0.2903\|± \|0.0132\|
	\| - leaderboard_math_hard\|N/A \|none \| 4\|exact_match \|↑ \|0.0982\|± \|0.0079\|
	\| - leaderboard_musr \|N/A \|none \| 0\|acc_norm \|↑ \|0.3862\|± \|0.0173\|
	```

	</details>

	[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
	<details><summary>See axolotl config</summary>

	axolotl version: `0.4.1`
	```yaml
	base_model: meta-llama/Meta-Llama-3.1-8B
	model_type: LlamaForCausalLM
	tokenizer_type: AutoTokenizer

	load_in_8bit: false
	# load_in_4bit: true
	strict: false

	datasets:
	- path: /workspace/datasets/dolphin-2.9.4/dolphin201-sharegpt2.jsonl
	type: sharegpt
	conversation: chatml

	chat_template: chatml
	# adapter: qlora
	# lora_r: 128
	# lora_alpha: 16
	# lora_modules_to_save: [embed_tokens, lm_head]
	# lora_dropout: 0.05
	# lora_target_linear: true

	unfrozen_parameters:
	- input_layernorm
	- model.norm
	- post_attention_layernorm
	- self_attn.rotary_emb
	- ^lm_head.weight$
	- ^model.embed_tokens.weight$
	# mlp.down_proj layers
	- model.layers.1.mlp.down_proj
	- model.layers.0.mlp.down_proj
	- model.layers.30.mlp.down_proj
	- model.layers.2.mlp.down_proj
	- model.layers.21.mlp.down_proj
	- model.layers.22.mlp.down_proj
	- model.layers.29.mlp.down_proj
	- model.layers.5.mlp.down_proj
	- model.layers.4.mlp.down_proj
	- model.layers.20.mlp.down_proj
	- model.layers.23.mlp.down_proj
	- model.layers.19.mlp.down_proj
	- model.layers.3.mlp.down_proj
	- model.layers.17.mlp.down_proj
	- model.layers.6.mlp.down_proj
	- model.layers.31.mlp.down_proj
	# mlp.up_proj layers
	- model.layers.4.mlp.up_proj
	- model.layers.3.mlp.up_proj
	- model.layers.0.mlp.up_proj
	- model.layers.5.mlp.up_proj
	- model.layers.7.mlp.up_proj
	- model.layers.6.mlp.up_proj
	- model.layers.2.mlp.up_proj
	- model.layers.1.mlp.up_proj
	- model.layers.8.mlp.up_proj
	- model.layers.12.mlp.up_proj
	- model.layers.14.mlp.up_proj
	- model.layers.9.mlp.up_proj
	- model.layers.15.mlp.up_proj
	- model.layers.17.mlp.up_proj
	- model.layers.13.mlp.up_proj
	- model.layers.19.mlp.up_proj
	# self_attn.k_proj layers
	- model.layers.29.self_attn.k_proj
	- model.layers.25.self_attn.k_proj
	- model.layers.23.self_attn.k_proj
	- model.layers.28.self_attn.k_proj
	- model.layers.21.self_attn.k_proj
	- model.layers.19.self_attn.k_proj
	- model.layers.22.self_attn.k_proj
	- model.layers.20.self_attn.k_proj
	- model.layers.24.self_attn.k_proj
	- model.layers.31.self_attn.k_proj
	- model.layers.27.self_attn.k_proj
	- model.layers.26.self_attn.k_proj
	- model.layers.17.self_attn.k_proj
	- model.layers.11.self_attn.k_proj
	- model.layers.18.self_attn.k_proj
	- model.layers.14.self_attn.k_proj
	# self_attn.o_proj layers
	- model.layers.14.self_attn.o_proj
	- model.layers.7.self_attn.o_proj
	- model.layers.5.self_attn.o_proj
	- model.layers.11.self_attn.o_proj
	- model.layers.6.self_attn.o_proj
	- model.layers.24.self_attn.o_proj
	- model.layers.9.self_attn.o_proj
	- model.layers.13.self_attn.o_proj
	- model.layers.10.self_attn.o_proj
	- model.layers.12.self_attn.o_proj
	- model.layers.8.self_attn.o_proj
	- model.layers.25.self_attn.o_proj
	- model.layers.21.self_attn.o_proj
	- model.layers.23.self_attn.o_proj
	- model.layers.15.self_attn.o_proj
	- model.layers.16.self_attn.o_proj
	# self_attn.q_proj layers
	- model.layers.8.self_attn.q_proj
	- model.layers.13.self_attn.q_proj
	- model.layers.9.self_attn.q_proj
	- model.layers.14.self_attn.q_proj
	- model.layers.10.self_attn.q_proj
	- model.layers.11.self_attn.q_proj
	- model.layers.0.self_attn.q_proj
	- model.layers.15.self_attn.q_proj
	- model.layers.1.self_attn.q_proj
	- model.layers.6.self_attn.q_proj
	- model.layers.5.self_attn.q_proj
	- model.layers.7.self_attn.q_proj
	- model.layers.12.self_attn.q_proj
	- model.layers.16.self_attn.q_proj
	- model.layers.17.self_attn.q_proj
	- model.layers.26.self_attn.q_proj
	# self_attn.v_proj layers
	- model.layers.26.self_attn.v_proj
	- model.layers.17.self_attn.v_proj
	- model.layers.3.self_attn.v_proj
	- model.layers.28.self_attn.v_proj
	- model.layers.29.self_attn.v_proj
	- model.layers.21.self_attn.v_proj
	- model.layers.15.self_attn.v_proj
	- model.layers.16.self_attn.v_proj
	- model.layers.20.self_attn.v_proj
	- model.layers.25.self_attn.v_proj
	- model.layers.6.self_attn.v_proj
	- model.layers.23.self_attn.v_proj
	- model.layers.4.self_attn.v_proj
	- model.layers.1.self_attn.v_proj
	- model.layers.22.self_attn.v_proj
	- model.layers.14.self_attn.v_proj
	# mlp.gate_proj layers
	- model.layers.1.mlp.gate_proj
	- model.layers.2.mlp.gate_proj
	- model.layers.3.mlp.gate_proj
	- model.layers.4.mlp.gate_proj
	- model.layers.0.mlp.gate_proj
	- model.layers.25.mlp.gate_proj
	- model.layers.26.mlp.gate_proj
	- model.layers.5.mlp.gate_proj
	- model.layers.24.mlp.gate_proj
	- model.layers.28.mlp.gate_proj
	- model.layers.23.mlp.gate_proj
	- model.layers.27.mlp.gate_proj
	- model.layers.21.mlp.gate_proj
	- model.layers.22.mlp.gate_proj
	- model.layers.29.mlp.gate_proj
	- model.layers.20.mlp.gate_proj




	dataset_prepared_path: /workspace/axolotl/dolph-2.9.4-nemo-prepared
	val_set_size: 0.01
	output_dir: /workspace/axolotl/dolphin-2.9.4-llama3.1-8b

	sequence_len: 8192
	sample_packing: true
	pad_to_sequence_len: true

	wandb_project: dolphin-2.9.4-llama3.1-8b
	wandb_watch:
	wandb_run_id:
	wandb_log_model:

	gradient_accumulation_steps: 16
	micro_batch_size: 2
	num_epochs: 3
	optimizer: adamw_torch
	lr_scheduler: cosine
	learning_rate: 5e-6
	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32:

	gradient_checkpointing: true
	gradient_checkpointing_kwargs:
	use_reentrant: false
	early_stopping_patience:
	resume_from_checkpoint:
	logging_steps: 1
	xformers_attention:
	flash_attention: true

	warmup_steps: 100
	# evals_per_epoch: 4
	eval_table_size:
	saves_per_epoch: 1
	save_total_limit: 2
	save_steps:
	debug:
	deepspeed: deepspeed_configs/zero3_bf16.json
	weight_decay: 0.1
	special_tokens:
	eos_token: "<\|im_end\|>"
	bos_token: "<\|begin_of_text\|>"
	pad_token: "<\|finetune_right_pad_id\|>"
	tokens:
	- "<\|im_start\|>"


	# fsdp:
	# - full_shard
	# - auto_wrap
	# fsdp_config:
	# fsdp_limit_all_gathers: true
	# fsdp_sync_module_states: true
	# fsdp_offload_params: true
	# fsdp_use_orig_params: false
	# fsdp_cpu_ram_efficient_loading: true
	# fsdp_transformer_layer_cls_to_wrap: MixtralSparseMoeBlock
	# fsdp_state_dict_type: FULL_STATE_DICT
	# fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
	# fsdp_sharding_strategy: FULL_SHARD
	# fsdp_forward_prefetch: false
	# fsdp_backward_prefetch: BACKWARD_PRE
	```

	</details><br>

	# workspace/axolotl/dolphin-2.9.4-llama3.1-8b

	This model is a fine-tuned version of [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) on the None dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.5655

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-06
	- train_batch_size: 2
	- eval_batch_size: 2
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 8
	- gradient_accumulation_steps: 16
	- total_train_batch_size: 256
	- total_eval_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 100
	- num_epochs: 3

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| 0.5837 \| 1.0180 \| 1161 \| 0.5814 \|
	\| 0.5525 \| 2.0179 \| 2322 \| 0.5671 \|
	\| 0.5514 \| 2.9624 \| 3420 \| 0.5655 \|


	### Framework versions

	- Transformers 4.44.0.dev0
	- Pytorch 2.4.0+cu121
	- Datasets 2.19.1
	- Tokenizers 0.19.1