Model Card for PokeeResearch
Model Details
Model Description
PokeeResearch-7B is a 7-billion-parameter deep research agent developed by Pokee AI to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.
The model integrates Reinforcement Learning from AI Feedback (RLAIF) with a robust reasoning scaffold, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.
- Developed by: Pokee AI
- Model type: Tool-augmented large language model (LLM) research agent
- Language(s): English, Chinese and many more
- License: Apache 2.0
- Finetuned from model: Qwen2.5-7B-Instruct
Model Sources
- Repository: https://github.com/Pokee-AI/PokeeResearchOSS
- Paper: PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold, Pokee AI, October 2025
- Project Page: https://pokee.ai/deepresearch-preview
Uses
Direct Use
PokeeResearch-7B is designed for deep research automation, where the model autonomously:
- Decomposes complex user queries
- Retrieves and reads from external sources
- Synthesizes factual, verifiable, and grounded answers
It can be used as a standalone research assistant or integrated into multi-agent systems to support academic, enterprise, or product-level research tasks.
Downstream Use
PokeeResearch-7B can be fine-tuned or extended for:
- Domain-specific scientific discovery
- Autonomous document retrieval and synthesis
- Multi-source verification and summarization pipelines
- Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)
Out-of-Scope Use
The model should not be used for:
- Generating unverified or speculative claims
- Automated decision-making in high-stakes domains (medical, legal, or financial)
- Applications requiring strict factual precision without external verification
- Generating content without citation or evidence tracing
Bias, Risks, and Limitations
PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:
- Dependence on external data quality and retrieval accuracy
- Potential semantic bias introduced by AI-based feedback signals
- Limited coverage for non-English or multi-modal reasoning tasks
- Risk of hallucinated synthesis when sources conflict or lack clarity
Recommendations
Users should:
- Cross-verify answers, especially in multi-hop reasoning cases
- Monitor output for citation accuracy and alignment with source data
- Refrain from using outputs as sole evidence in decision-critical contexts
How to Get Started with the Model
please refer to the following codebase for how to use PokeeResearch-7B https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md
Training Details
Training Data
- Dataset: MiroRL-GenQA dataset (MiroMind AI, 2025)
- Data characteristics: Complex, multi-turn question–answer pairs requiring multi-step reasoning
- Data filtering: No benchmark data used for testing; the model was trained only on open-domain text Q&A samples
Training Procedure
Preprocessing
- Normalization and tokenization aligned with Qwen2.5 tokenizer
- Structured prompt–response pairs in research/verification format (
<tool_call>
,<answer>
,<verification>
)
Training Hyperparameters
- Algorithm: RLOO (REINFORCE Leave-One-Out)
- Batch size: 64
- Research threads per prompt: 8
- Learning rate: 3e-6
- Context limit: 32,768 tokens
- Steps: 140 fine-tuning iterations
- Regularization: None (no entropy or KL regularization)
- Precision regime: bf16 mixed precision
Reward Design
- Combined reward signal from:
- AI feedback (semantic equivalence via external LLM judge)
- Format adherence reward (ensures correct agent behavior)
Speeds, Sizes, Times
- Model size: 7 billion parameters
- Training duration: ~5 days on 8 × A100 80G GPUs
- Checkpoint size: ~13 GB
Evaluation
Testing Data, Factors & Metrics
Testing Data
10 open-domain research and QA benchmarks:
- NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam
Factors
- Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.
- Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).
Metrics
- Mean accuracy (mean@4 across independent research threads) based on
Results
PokeeResearch-7B (RTS variant) and PokeeResearch-7B outperforms all baselines at 7B scale across 10 benchmarks.
Highlights (mean@4 accuracy):
Method | HLE | GAIA | BrowseComp | BAMB | 2WIKI | TQ | NQ | POPQA | MUSIQUE | HOTPOTQA |
---|---|---|---|---|---|---|---|---|---|---|
R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 |
SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 |
ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 |
ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 |
DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 |
PR | 15.2 | 36.9 | 5.4 | 74.5 | 74.0 | 91.3 | 75.1 | 59.8 | 39.8 | 71.2 |
PR+ | 17.6 | 41.3 | 8.4 | 75.0 | 75.0 | 91.8 | 75.0 | 60.0 | 41.4 | 71.6 |
Summary
PokeeResearch-7B variants achieves state-of-the-art performance among 7B-scale open deep research agents, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.
Technical Specifications
Model Architecture and Objective
- Base Architecture: Transformer decoder (Qwen2.5-7B-Instruct backbone)
- Objective: Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning
Compute Infrastructure
Hardware
- NVIDIA A100 80GB GPUs ×8 for training and x1 for inference
Citation
BibTeX:
@article{pokee2025deepresearch,
title={PokeeResearch: Effective Deep Research via
Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
author={Yi Wan* and Jiuqi Wang* and Liam Li
and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
journal={Pokee AI Technical Report},
year={2025},
url={https://arxiv.org/pdf/2510.15862}
}
APA: Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold. Pokee AI.
Glossary
- RLAIF: Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals.
- RLOO: REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning.
- RTS: Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time.
More Information
For technical details, visit: https://github.com/Pokee-AI/PokeeResearchOSS
For inquiries, contact: hello@pokee.ai
Model Card Authors
Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team
Model Card Contact
Pokee AI Team — hello@pokee.ai
- Downloads last month
- 2,975