MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper • 2410.10563 • Published Oct 14, 2024 • 38
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Paper • 2406.01574 • Published Jun 3, 2024 • 44
MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation Paper • 2406.15252 • Published Jun 21, 2024 • 15
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs Paper • 2406.15319 • Published Jun 21, 2024 • 63
Investigating Answerability of LLMs for Long-Form Question Answering Paper • 2309.08210 • Published Sep 15, 2023 • 12
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI Paper • 2307.10172 • Published Jul 19, 2023 • 12