Teleut-7b-GGUF / README.md

aashish1904

Upload README.md with huggingface_hub

ff0d765 verified 24 days ago

5.26 kB


	---

	library_name: transformers
	license: apache-2.0
	base_model: Qwen/Qwen2.5-7B
	datasets:
	- allenai/tulu-3-sft-mixture

	---

	[![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)


	# QuantFactory/Teleut-7b-GGUF
	This is quantized version of [allura-org/Teleut-7b](https://huggingface.co/allura-org/Teleut-7b) created using llama.cpp

	# Original Model Card


	# Teleut 7b

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/UqIi8eztdptvt52Mak_1K.png)

	A replication attempt of Tulu 3 on the Qwen 2.5 base models.

	## Evals (so far)
	\| \| Teleut 7B (measured) \| Tülu 3 SFT 8B (reported) \| Qwen 2.5 7B Instruct (reported) \| Ministral 8B (reported) \| Mistral 7B v0.3 (reported)
	\|-------------------------\|----------------------\|--------------------------\|---------------------------------\|-------------------------\|---------------------------
	\|BBH (3 shot, CoT) \|64.4% \|67.9% \|21.7% \|56.2% \|47.0%<sup>NLL</sup>
	\|GSM8K (8 shot, CoT) \|78.5% \|76.2% \|83.8% \|80.0% \|xx.x%
	\|IFEval (prompt loose) \|66.3% \|72.8% \|74.7% \|56.4% \|53.0%
	\|MMLU (0 shot, CoT) \|73.2% \|65.9% \|76.6% \|68.5% \|30.7%<sup>5-shot</sup>
	\|MMLU Pro (0 shot, CoT) \|48.3% \|44.3% \|56.3%<sup>Unknown</sup> \|32.9%<sup>5-shot</sup> \|30.7%<sup>5-shot</sup>
	\|PopQA (15 shot) \|18.9% \|29.3% \|18.1% \|20.2% \|xx.x%
	\|TruthfulQA \|47.2% \|46.8% \|63.1% \|55.5% \|xx.x%

	## Credits
	Big thanks to Retis Labs for being providing my 8xH100 polycule used to train and test this model!
	Another big thanks to AllenAI for publishing the Tülu 3 data and model series (as well as the paper and details on training), as well as Alibaba for training the original Qwen 2.5 base model series!

	```
	@article{lambert2024tulu3,
	title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
	author = {
	Nathan Lambert and
	Jacob Morrison and
	Valentina Pyatkin and
	Shengyi Huang and
	Hamish Ivison and
	Faeze Brahman and
	Lester James V. Miranda and
	Alisa Liu and
	Nouha Dziri and
	Shane Lyu and
	Yuling Gu and
	Saumya Malik and
	Victoria Graf and
	Jena D. Hwang and
	Jiangjiang Yang and
	Ronan Le Bras and
	Oyvind Tafjord and
	Chris Wilhelm and
	Luca Soldaini and
	Noah A. Smith and
	Yizhong Wang and
	Pradeep Dasigi and
	Hannaneh Hajishirzi
	},
	year = {2024},
	email = {tulu@allenai.org}
	}
	```

	## Training procedure

	[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3.5e-06
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 8
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 128
	- total_eval_batch_size: 64
	- optimizer: Use paged_ademamix_8bit and the args are:
	No additional optimizer arguments
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 370
	- num_epochs: 1

	### Framework versions

	- Transformers 4.46.3
	- Pytorch 2.5.1+cu124
	- Datasets 3.1.0
	- Tokenizers 0.20.3

	### Configuration
	<details><summary>See axolotl config</summary>

	axolotl version: `0.5.2`
	```yaml
	base_model: Qwen/Qwen2.5-7B

	plugins:
	- axolotl.integrations.liger.LigerPlugin
	liger_rope: true
	liger_rms_norm: true
	liger_glu_activation: true
	liger_fused_linear_cross_entropy: true

	strict: false

	chat_template: chatml
	datasets:
	- path: allenai/tulu-3-sft-mixture
	type: chat_template
	split: train
	field_messages: messages

	dataset_prepared_path: last_run_prepared
	#val_set_size: 0.02
	output_dir: ./ckpts

	sequence_len: 8192
	#sample_packing: true
	pad_to_sequence_len: true

	wandb_project: qwen-2.5-7b-sft
	wandb_entity:
	wandb_watch:
	wandb_name:
	wandb_log_model:

	gradient_accumulation_steps: 2
	micro_batch_size: 8
	num_epochs: 1
	optimizer: paged_ademamix_8bit
	lr_scheduler: cosine
	learning_rate: 3.5e-6

	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32: false

	gradient_checkpointing: true
	gradient_checkpointing_kwargs:
	use_reentrant: false
	early_stopping_patience:
	resume_from_checkpoint:
	logging_steps: 1
	xformers_attention:
	flash_attention: true

	deepspeed: deepspeed_configs/zero3_bf16.json

	warmup_steps: 370
	#evals_per_epoch: 4
	eval_table_size:
	saves_per_epoch: 2
	debug:
	weight_decay: 0.0

	```

	</details><br>