license: apache-2.0
datasets:
- Gen-Verse/ReasonFlux-V2-Reasoner-DPO
language:
- en
- zh
base_model:
- Qwen/Qwen3-1.7B
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
- code
- trl
- DPO
ReasonFlux-Qwen3-dpo
ReasonFlux-Qwen3-dpo is a fine-tuned version of Qwen3-1.7B, trained on the Gen-Verse/ReasonFlux-V2-Reasoner-DPO dataset. It adopts a template-augmented reasoning paradigm, internalizing structured thought templates through iterative hierarchical reinforcement learning and direct preference optimization (DPO). This design enables the model to reason more transparently, consistently, and adaptively across multi-domain scientific and mathematical tasks.
GGUF: https://huggingface.co/prithivMLmods/ReasonFlux-Qwen3-dpo-GGUF
Key Features
Template-Augmented Reasoning Incorporates structured reasoning templates that guide step-by-step thinking, improving coherence and reducing hallucinations.
DPO Fine-Tuning with Hierarchical Reinforcement Leverages direct preference optimization along with iterative reinforcement learning, internalizing high-quality reasoning behaviors.
Scientific & Mathematical Expertise Excels at symbolic derivations, step-by-step proofs, and multi-domain STEM reasoning (physics, chemistry, biology, mathematics).
Code Understanding & Generation Provides detailed coding explanations, debugging support, and optimization hints across multiple programming languages.
Structured Output Mastery Fluent in producing outputs across LaTeX, Markdown, JSON, CSV, and YAML for seamless integration in research and technical workflows.
Efficient Deployment Lightweight yet powerful, designed for mid-range GPUs, research clusters, and edge AI environments.
Quickstart with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "prithivMLmods/ReasonFlux-Qwen3-dpo"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Explain how reinforcement learning differs from supervised learning with real-world examples."
messages = [
{"role": "system", "content": "You are a reasoning tutor skilled in science, math, and coding."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Intended Use
- Advanced reasoning tutor for mathematics, coding, and scientific research
- Research assistant capable of structured problem-solving with template-guided reasoning
- Technical documentation and structured data generation
- STEM-focused chatbot or API for research and education workflows
- Deployment in environments requiring transparent reasoning with efficient compute use
Limitations
- Not optimized for casual or creative writing
- Context limitations may restrict multi-document or full codebase comprehension
- Specializes in structured reasoning—general chit-chat may underperform
- Optimized for clarity of reasoning rather than natural conversational tone