ConsultEnv Qwen 3B โ€” GRPO LoRA Adapter

A LoRA adapter fine-tuned on Qwen/Qwen2.5-3B-Instruct to plan consulting engagements in ConsultEnv, a long-horizon RL environment for the OpenEnv Hackathon.

Training

Base model Qwen/Qwen2.5-3B-Instruct (3.09B params)
Method SFT + GRPO (Group-Relative Policy Optimization)
Framework HuggingFace TRL + PEFT
LoRA config r=8, alpha=16, dropout=0.1
Targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params ~15M (0.5% of model)
GPU NVIDIA RTX A4000 (16.9 GB)
SFT lr=2e-5, 1 epoch, final loss=0.133
GRPO lr=5e-6, 1 epoch, 8 generations/step, KL beta=0.1

Pipeline

  1. SFT phase: Fine-tuned on near-optimal consulting engagement demonstrations. Teaches JSON action schema and baseline strategies, leaving specific gaps for GRPO to discover.

  2. GRPO phase: 8 rollouts per scenario per training step. Per-module intermediate rewards provide credit assignment โ€” each module's gradient comes from its own step reward, not the noisy terminal signal. LoRA dropout during rollouts ensures exploration diversity.

Results

Scenario Pre-SFT Post-SFT Post-GRPO
Benchmarking Study (Easy) 0.587 0.688 0.695
Cost Optimization (Medium) 0.167 0.316 0.496
Ops Transformation (Hard) 0.094 0.335 0.394
Commercial DD (Expert) 0.148 0.231 0.312
Mean 0.249 0.392 0.474

Total lift: +90% over untrained baseline. GRPO adds +21% over SFT alone, with the biggest gains on medium and hard scenarios where strategic decisions matter most.

Environment

ConsultEnv simulates end-to-end consulting engagement management:

  • 4 scenarios from Easy to Expert (pass rates as low as 1/11)
  • 22-turn episodes: staff a team, then execute 7 modules with 3 sub-tasks each
  • Action space: ~400 options per step, ~10^13 total strategy paths
  • Reward: per-step (sequencing + quality + efficiency) and terminal (profit + timeline + quality)
  • 13 novel mechanics: cascading quality, workshop isolation, tool traps, budget nuclear, discovery breakpoints, and more

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
model = PeftModel.from_pretrained(base, "munish0838/cenv-trl-grpo-v14")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

prompt = """You are a consulting engagement planner. Given the scenario, output a JSON action.

Scenario: HealthFirst hospital chain benchmarking study. Budget: $380,250. Timeline: 15 days.
Available action: staff_team

Output the action as JSON:"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Links

Environment ConsultEnv on HF Spaces
Training notebook train_sft_grpo_trl_v14.ipynb
Blog BLOG.md
Training plots training_overview.png

License

MIT

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for munish0838/consultenv-qwen3b-grpo-lora

Base model

Qwen/Qwen2.5-3B
Adapter
(1277)
this model