OREO: Offline REasoning Optimization
Source code for Offline Reinforcement Learning for LLM Multi-Step Reasoning
Installation
This repo is based on OpenRLHF and the installation follows a similar process. We recommend using Docker to setup the environment.
First build Docker image
cd dockerfile
docker build -t [IMAGE_NAME] .
Start a docker container
docker run -itd --ipc host --gpus all [IMAGE_NAME] bash
Attach to the container
docker exec -it [CONTAINER_ID] /bin/bash
Install the current repo
cd [PATH_TO_THIS_REPO]
pip install -e .
As the data collection process involves randomness, we will publish the training data used in our experiments in the near future.
Reproduction
Training
You may need to change the following command line options in the following scripts:
--train_file
specifies the path of training data in OREO experiments.--dataset
specifies the path of training data in SFT experiments.--save_path
specifies the path to save the model.--pretrain
specifies the path to load the pretrained model. In OREO experiments, this should be the path to the SFT model.
Math Reasoning
Supervised fine-tuning
cd example/scripts
bash train_oreo_sft.sh
OREO training
cd example/scripts
bash train_oreo.sh
To train the DeepSeekMath-7B-Instruct
model,
cd example/scripts
bash train_oreo_deepseek-math.sh
Note that DeepSeekMath-7B-Instruct
is already supervise fine-tuned, so we don't have an SFT phase here.
ALFWorld
Supervised fine-tuning
cd example/scripts
bash train_oreo_alfworld_sft.sh
OREO training
cd example/scripts
bash train_oreo_alfworld.sh
Evaluation
Math Reasoning
Make sure you have antlr4-python3-runtime==4.11.0
installed.
For Qwen-based models
cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --save [SAVE_GENERATED_RESULTS_JSONL]
For DeepSeekMath-based models
cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --no_bos --save [SAVE_GENERATED_RESULTS_JSONL]
Note the --no_bos
option here.
ALFWorld
This part requires ALFWorld to be installed.
First start an vllm server
python -m vllm.entrypoints.openai.api_server --model [PATH_TO_YOUR_MODEL]
Then run evaluation with
cd example/scripts
python ../scratch/run_alfworld_async.py --model [PATH_TO_YOUR_MODEL] --save_dir [SAVE_GENERATED_TRAJS]
You can use --split eval_in_distribution
for seen environments.
Reference
@inproceedings{Wang2024OfflineRL,
title={Offline Reinforcement Learning for LLM Multi-Step Reasoning},
author={Huaijie Wang and Shibo Hao and Hanze Dong and Shenao Zhang and Yilin Bao and Ziran Yang and Yi Wu},
year={2024},
url={https://api.semanticscholar.org/CorpusID:274965107}
}