jwhj's picture
Update README.md
c1f1e0d verified
|
raw
history blame
3.26 kB
# OREO: Offline REasoning Optimization
Source code for [Offline Reinforcement Learning for LLM Multi-Step Reasoning](https://arxiv.org/abs/2412.16145)
Model: [Policy](https://huggingface.co/jwhj/Qwen2.5-Math-1.5B-OREO) | [Value](https://huggingface.co/jwhj/Qwen2.5-Math-1.5B-OREO-Value)
<img src="https://raw.githubusercontent.com/jwhj/OREO/refs/heads/main/OREO.png" alt="Image description" width="50%" />
# Installation
This repo is based on [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and the installation follows a similar process. We recommend using Docker to setup the environment.
First build Docker image
```bash
cd dockerfile
docker build -t [IMAGE_NAME] .
```
Start a docker container
```bash
docker run -itd --ipc host --gpus all [IMAGE_NAME] bash
```
Attach to the container
```bash
docker exec -it [CONTAINER_ID] /bin/bash
```
Install the current repo
```bash
cd [PATH_TO_THIS_REPO]
pip install -e .
```
As the data collection process involves randomness, we will publish the training data used in our experiments in the near future.
# Reproduction
## Training
You may need to change the following command line options in the following scripts:
- `--train_file` specifies the path of training data in OREO experiments.
- `--dataset` specifies the path of training data in SFT experiments.
- `--save_path` specifies the path to save the model.
- `--pretrain` specifies the path to load the pretrained model. In OREO experiments, this should be the path to the SFT model.
### Math Reasoning
Supervised fine-tuning
```bash
cd example/scripts
bash train_oreo_sft.sh
```
OREO training
```bash
cd example/scripts
bash train_oreo.sh
```
To train the `DeepSeekMath-7B-Instruct` model,
```bash
cd example/scripts
bash train_oreo_deepseek-math.sh
```
Note that `DeepSeekMath-7B-Instruct` is already supervise fine-tuned, so we don't have an SFT phase here.
### ALFWorld
Supervised fine-tuning
```bash
cd example/scripts
bash train_oreo_alfworld_sft.sh
```
OREO training
```bash
cd example/scripts
bash train_oreo_alfworld.sh
```
## Evaluation
### Math Reasoning
Make sure you have `antlr4-python3-runtime==4.11.0` installed.
For Qwen-based models
```bash
cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --save [SAVE_GENERATED_RESULTS_JSONL]
```
For DeepSeekMath-based models
```bash
cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --no_bos --save [SAVE_GENERATED_RESULTS_JSONL]
```
Note the `--no_bos` option here.
### ALFWorld
This part requires [ALFWorld](https://github.com/alfworld/alfworld) to be installed.
First start an vllm server
```bash
python -m vllm.entrypoints.openai.api_server --model [PATH_TO_YOUR_MODEL]
```
Then run evaluation with
```bash
cd example/scripts
python ../scratch/run_alfworld_async.py --model [PATH_TO_YOUR_MODEL] --save_dir [SAVE_GENERATED_TRAJS]
```
You can use `--split eval_in_distribution` for seen environments.
## Reference
```bibtex
@inproceedings{Wang2024OfflineRL,
title={Offline Reinforcement Learning for LLM Multi-Step Reasoning},
author={Huaijie Wang and Shibo Hao and Hanze Dong and Shenao Zhang and Yilin Bao and Ziran Yang and Yi Wu},
year={2024},
url={https://api.semanticscholar.org/CorpusID:274965107}
}
```