metadata

license: apache-2.0
datasets:
  - Gen-Verse/ReasonFlux-V2-Reasoner-DPO
language:
  - en
  - zh
base_model:
  - Qwen/Qwen3-1.7B
pipeline_tag: text-generation
library_name: transformers
tags:
  - text-generation-inference
  - code
  - trl
  - DPO

ReasonFlux-Qwen3-dpo

ReasonFlux-Qwen3-dpo is a fine-tuned version of Qwen3-1.7B, trained on the Gen-Verse/ReasonFlux-V2-Reasoner-DPO dataset. It adopts a template-augmented reasoning paradigm, internalizing structured thought templates through iterative hierarchical reinforcement learning and direct preference optimization (DPO). This design enables the model to reason more transparently, consistently, and adaptively across multi-domain scientific and mathematical tasks.

GGUF: https://huggingface.co/prithivMLmods/ReasonFlux-Qwen3-dpo-GGUF

Key Features

Template-Augmented Reasoning Incorporates structured reasoning templates that guide step-by-step thinking, improving coherence and reducing hallucinations.
DPO Fine-Tuning with Hierarchical Reinforcement Leverages direct preference optimization along with iterative reinforcement learning, internalizing high-quality reasoning behaviors.
Scientific & Mathematical Expertise Excels at symbolic derivations, step-by-step proofs, and multi-domain STEM reasoning (physics, chemistry, biology, mathematics).
Code Understanding & Generation Provides detailed coding explanations, debugging support, and optimization hints across multiple programming languages.
Structured Output Mastery Fluent in producing outputs across LaTeX, Markdown, JSON, CSV, and YAML for seamless integration in research and technical workflows.
Efficient Deployment Lightweight yet powerful, designed for mid-range GPUs, research clusters, and edge AI environments.

Quickstart with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "prithivMLmods/ReasonFlux-Qwen3-dpo"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Explain how reinforcement learning differs from supervised learning with real-world examples."

messages = [
    {"role": "system", "content": "You are a reasoning tutor skilled in science, math, and coding."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Intended Use

Advanced reasoning tutor for mathematics, coding, and scientific research
Research assistant capable of structured problem-solving with template-guided reasoning
Technical documentation and structured data generation
STEM-focused chatbot or API for research and education workflows
Deployment in environments requiring transparent reasoning with efficient compute use

Limitations

Not optimized for casual or creative writing
Context limitations may restrict multi-document or full codebase comprehension
Specializes in structured reasoning—general chit-chat may underperform
Optimized for clarity of reasoning rather than natural conversational tone