TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons Paper • 2504.19982 • Published Apr 28
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents Paper • 2505.01592 • Published May 2
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models Paper • 2311.07022 • Published Nov 13, 2023 • 1
Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare Paper • 2404.16621 • Published Apr 25, 2024
ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents Paper • 2411.00927 • Published Nov 1, 2024 • 2