✨ Agentic Reinforced Policy Optimization (ARPO)

Agentic Reinforced Policy Optimization (ARPO) is a novel agentic RL algorithm tailored for training multi-turn Large Language Model (LLM)-based agents. It addresses the challenges of balancing LLMs' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.

💡 Overview

We propose Agentic Reinforced Policy Optimization (ARPO), an agentic RL algorithm tailored for training multi-turn LLM-based agent. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.

In figure (left), The initial tokens generated by the LLM after receiving each round of tool-call feedback consistently exhibit a high entropy. This indicates that external tool-call significantly introduces uncertainty into the LLM’s reasoning process.
In the figure (right), we validate ARPO's performance across 13 datasets. Notably, Qwen3-14B with ARPO excelled in Pass@5, achieving 61.2% on GAIA and 24.0% on HLE, while requiring only about half the tool calls compared to GRPO during training.

🏃 Quick Start

This section provides a basic example of how to perform inference with an ARPO-trained model using the transformers library. For more detailed instructions on training and evaluation, please refer to the official GitHub repository.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# Load the model and tokenizer
# Replace "dongguanting/Llama3.1-8B-ARPO" with the specific ARPO checkpoint you want to use.
model_name = "dongguanting/Llama3.1-8B-ARPO" # Example ARPO model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # Use bfloat16 for better performance on compatible hardware
    device_map="auto",
    trust_remote_code=True # Required for custom modeling if applicable
).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Set generation configuration based on model's generation_config.json
model.generation_config = GenerationConfig.from_pretrained(
    model_name,
    temperature=0.6,
    top_p=0.9,
    do_sample=True,
    eos_token_id=[128001, 128008, 128009], # From special_tokens_map.json and generation_config.json
    pad_token_id=tokenizer.eos_token_id, # Common practice for LLMs
)

# Prepare messages using the chat template (e.g., Llama 3.1 or similar)
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

# Apply chat template and tokenize input
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

# Generate response
with torch.no_grad():
    output_ids = model.generate(input_ids, max_new_tokens=256)

# Decode and print the generated text, excluding the input prompt
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True).strip()

print(f"Assistant: {response}")

📄 Citation

If you find this work helpful, please cite our paper:

@misc{dong2025arpo,
      title={Agentic Reinforced Policy Optimization}, 
      author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou},
      year={2025},
      eprint={2507.19849},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.19849}, 
}

🤝 Acknowledge

This training implementation builds upon Tool-Star, Llama Factory, verl and ReCall. For evaluation, we rely on WebThinker, HIRA, WebSailor, Search-o1, and FlashRAG. The Python interpreter design references ToRA and ToRL, while our models are trained using Qwen2.5. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.

📞 Contact

For any questions or feedback, please reach out to us at dongguanting@ruc.edu.cn.

dongguanting
/

Llama3.1-8B-ARPO

✨ Agentic Reinforced Policy Optimization (ARPO)

💡 Overview

🏃 Quick Start

📄 Citation

🤝 Acknowledge

📞 Contact

Model tree for dongguanting/Llama3.1-8B-ARPO

Collection including dongguanting/Llama3.1-8B-ARPO

ARPO