ConsultEnv Qwen 3B — GRPO LoRA Adapter

A LoRA adapter fine-tuned on Qwen/Qwen2.5-3B-Instruct to plan consulting engagements in ConsultEnv, a long-horizon RL environment for the OpenEnv Hackathon.

Training


Base model	Qwen/Qwen2.5-3B-Instruct (3.09B params)
Method	SFT + GRPO (Group-Relative Policy Optimization)
Framework	HuggingFace TRL + PEFT
LoRA config	r=8, alpha=16, dropout=0.1
Targets	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params	~15M (0.5% of model)
GPU	NVIDIA RTX A4000 (16.9 GB)
SFT	lr=2e-5, 1 epoch, final loss=0.133
GRPO	lr=5e-6, 1 epoch, 8 generations/step, KL beta=0.1

Pipeline

SFT phase: Fine-tuned on near-optimal consulting engagement demonstrations. Teaches JSON action schema and baseline strategies, leaving specific gaps for GRPO to discover.
GRPO phase: 8 rollouts per scenario per training step. Per-module intermediate rewards provide credit assignment — each module's gradient comes from its own step reward, not the noisy terminal signal. LoRA dropout during rollouts ensures exploration diversity.

Results

Scenario	Pre-SFT	Post-SFT	Post-GRPO
Benchmarking Study (Easy)	0.587	0.688	0.695
Cost Optimization (Medium)	0.167	0.316	0.496
Ops Transformation (Hard)	0.094	0.335	0.394
Commercial DD (Expert)	0.148	0.231	0.312
Mean	0.249	0.392	0.474

Total lift: +90% over untrained baseline. GRPO adds +21% over SFT alone, with the biggest gains on medium and hard scenarios where strategic decisions matter most.

Environment

ConsultEnv simulates end-to-end consulting engagement management:

4 scenarios from Easy to Expert (pass rates as low as 1/11)
22-turn episodes: staff a team, then execute 7 modules with 3 sub-tasks each
Action space: ~400 options per step, ~10^13 total strategy paths
Reward: per-step (sequencing + quality + efficiency) and terminal (profit + timeline + quality)
13 novel mechanics: cascading quality, workshop isolation, tool traps, budget nuclear, discovery breakpoints, and more

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
model = PeftModel.from_pretrained(base, "munish0838/cenv-trl-grpo-v14")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

prompt = """You are a consulting engagement planner. Given the scenario, output a JSON action.

Scenario: HealthFirst hospital chain benchmarking study. Budget: $380,250. Timeline: 15 days.
Available action: staff_team

Output the action as JSON:"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Links


Environment	ConsultEnv on HF Spaces
Training notebook	train_sft_grpo_trl_v14.ipynb
Blog	BLOG.md
Training plots	training_overview.png

License

MIT

Downloads last month: 29

Model tree for munish0838/consultenv-qwen3b-grpo-lora

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

(1277)

this model