Spaces:

Dovakiins
/

qwerrwe

Build error

App Files Files Community

qwerrwe / docs /rlhf.qmd

tokestermw

Add ORPO example and e2e test (#1572)

98c25e1 unverified 7 months ago

raw

history blame

1.83 kB

	---
	title: "RLHF (Beta)"
	description: "Reinforcement Learning from Human Feedback is a method whereby a language model is optimized from data using human feedback."
	---

	### Overview

	Reinforcement Learning from Human Feedback is a method whereby a language model is optimized from data using human
	feedback. Various methods include, but not limited to:

	- Proximal Policy Optimization (PPO) (not yet supported in axolotl)
	- Direct Preference Optimization (DPO)
	- Identity Preference Optimization (IPO)


	### RLHF using Axolotl

	>[!IMPORTANT]
	>This is a BETA feature and many features are not fully implemented. You are encouraged to open new PRs to improve the integration and functionality.

	The various RL training methods are implemented in trl and wrapped via axolotl. Below are various examples with how you can use various preference datasets to train models that use ChatML

	#### DPO
	```yaml
	rl: dpo
	datasets:
	- path: Intel/orca_dpo_pairs
	split: train
	type: chatml.intel
	- path: argilla/ultrafeedback-binarized-preferences
	split: train
	type: chatml.argilla
	```

	#### IPO
	```yaml
	rl: ipo
	```

	#### ORPO

	Paper: https://arxiv.org/abs/2403.07691

	```yaml
	rl: orpo
	orpo_alpha: 0.1
	remove_unused_columns: false

	chat_template: chatml
	datasets:
	- path: argilla/ultrafeedback-binarized-preferences-cleaned
	type: chat_template.argilla
	```

	#### Using local dataset files
	```yaml
	datasets:
	- ds_type: json
	data_files:
	- orca_rlhf.jsonl
	split: train
	type: chatml.intel
	```

	#### Trl autounwrap for peft

	Trl supports autounwrapping peft models, so that a ref model does not need to be additionally loaded, leading to less VRAM needed. This is on by default. To turn it off, pass the following config.

	```yaml
	# load ref model when adapter training.
	rl_adapter_ref_model: true
	```