Instructions to use munish0838/consultenv-qwen3b-grpo-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use munish0838/consultenv-qwen3b-grpo-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct") model = PeftModel.from_pretrained(base_model, "munish0838/consultenv-qwen3b-grpo-lora") - Notebooks
- Google Colab
- Kaggle
ConsultEnv Qwen 3B โ GRPO LoRA Adapter
A LoRA adapter fine-tuned on Qwen/Qwen2.5-3B-Instruct to plan consulting engagements in ConsultEnv, a long-horizon RL environment for the OpenEnv Hackathon.
Training
| Base model | Qwen/Qwen2.5-3B-Instruct (3.09B params) |
| Method | SFT + GRPO (Group-Relative Policy Optimization) |
| Framework | HuggingFace TRL + PEFT |
| LoRA config | r=8, alpha=16, dropout=0.1 |
| Targets | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable params | ~15M (0.5% of model) |
| GPU | NVIDIA RTX A4000 (16.9 GB) |
| SFT | lr=2e-5, 1 epoch, final loss=0.133 |
| GRPO | lr=5e-6, 1 epoch, 8 generations/step, KL beta=0.1 |
Pipeline
SFT phase: Fine-tuned on near-optimal consulting engagement demonstrations. Teaches JSON action schema and baseline strategies, leaving specific gaps for GRPO to discover.
GRPO phase: 8 rollouts per scenario per training step. Per-module intermediate rewards provide credit assignment โ each module's gradient comes from its own step reward, not the noisy terminal signal. LoRA dropout during rollouts ensures exploration diversity.
Results
| Scenario | Pre-SFT | Post-SFT | Post-GRPO |
|---|---|---|---|
| Benchmarking Study (Easy) | 0.587 | 0.688 | 0.695 |
| Cost Optimization (Medium) | 0.167 | 0.316 | 0.496 |
| Ops Transformation (Hard) | 0.094 | 0.335 | 0.394 |
| Commercial DD (Expert) | 0.148 | 0.231 | 0.312 |
| Mean | 0.249 | 0.392 | 0.474 |
Total lift: +90% over untrained baseline. GRPO adds +21% over SFT alone, with the biggest gains on medium and hard scenarios where strategic decisions matter most.
Environment
ConsultEnv simulates end-to-end consulting engagement management:
- 4 scenarios from Easy to Expert (pass rates as low as 1/11)
- 22-turn episodes: staff a team, then execute 7 modules with 3 sub-tasks each
- Action space: ~400 options per step, ~10^13 total strategy paths
- Reward: per-step (sequencing + quality + efficiency) and terminal (profit + timeline + quality)
- 13 novel mechanics: cascading quality, workshop isolation, tool traps, budget nuclear, discovery breakpoints, and more
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
model = PeftModel.from_pretrained(base, "munish0838/cenv-trl-grpo-v14")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
prompt = """You are a consulting engagement planner. Given the scenario, output a JSON action.
Scenario: HealthFirst hospital chain benchmarking study. Budget: $380,250. Timeline: 15 days.
Available action: staff_team
Output the action as JSON:"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Links
| Environment | ConsultEnv on HF Spaces |
| Training notebook | train_sft_grpo_trl_v14.ipynb |
| Blog | BLOG.md |
| Training plots | training_overview.png |
License
MIT
- Downloads last month
- 29