Submitted by taesiri 4 Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math Salesforce 6 2
Submitted by weirayao 30 Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels Salesforce 58 2
Submitted by canqin001 15 UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG Salesforce 10 4