WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines Paper • 2410.12705 • Published Oct 16 • 29
ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation Paper • 2410.01731 • Published Oct 2 • 15
BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation Paper • 2410.01171 • Published Oct 2 • 5
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning Paper • 2409.12183 • Published Sep 18 • 36
Challenges and Responses in the Practice of Large Language Models Paper • 2408.09416 • Published Aug 18 • 1
Characterizing Prompt Compression Methods for Long Context Inference Paper • 2407.08892 • Published Jul 11 • 9
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering Paper • 2409.06595 • Published Sep 10 • 37
LiveBench: A Challenging, Contamination-Free LLM Benchmark Paper • 2406.19314 • Published Jun 27 • 19
Efficient Detection of Toxic Prompts in Large Language Models Paper • 2408.11727 • Published Aug 21 • 12
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval Paper • 2407.12883 • Published Jul 16 • 8
Writing in the Margins: Better Inference Pattern for Long Context Retrieval Paper • 2408.14906 • Published Aug 27 • 138
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets Paper • 2406.18518 • Published Jun 26 • 23
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models Paper • 2309.03883 • Published Sep 7, 2023 • 33