TextWorld GRPO BehR-only (facts=0, len_penalty=0). Base: Textworld-Qwen2.5-7B. exponential reward, lr=5e-6, KL=0.001, n=5, T=1.3, 8xA100.
YOULING HUANG
Ricardo-H
·
AI & ML interests
None yet
Recent Activity
updated
a collection
about 1 hour ago
tw-wm-0301 updated
a model about 1 hour ago
Ricardo-H/ws-wm-0301-step-160 published
a model about 1 hour ago
Ricardo-H/ws-wm-0301-step-160 Organizations
None yet