EMMA: End-to-End Multimodal Model for Autonomous Driving Paper • 2410.23262 • Published 10 days ago • 2
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models Paper • 2411.00836 • Published 11 days ago • 14
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper • 2410.23266 • Published 10 days ago • 19
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published 10 days ago • 43
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays Paper • 2410.21969 • Published 12 days ago • 8
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark Paper • 2410.19168 • Published 16 days ago • 19
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data Paper • 2410.18558 • Published 17 days ago • 17
CLEAR: Character Unlearning in Textual and Visual Modalities Paper • 2410.18057 • Published 17 days ago • 197
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks Paper • 2410.19100 • Published 16 days ago • 6
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction Paper • 2410.21169 • Published 12 days ago • 29
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples Paper • 2410.14669 • Published 22 days ago • 35
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts Paper • 2410.18071 • Published 17 days ago • 6
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Paper • 2410.17434 • Published 18 days ago • 24
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models Paper • 2410.17637 • Published 18 days ago • 34
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning Paper • 2410.17779 • Published 18 days ago • 7
WAFFLE: Multi-Modal Model for Automated Front-End Development Paper • 2410.18362 • Published 17 days ago • 11
Distill Visual Chart Reasoning Ability from LLMs to MLLMs Paper • 2410.18798 • Published 17 days ago • 19
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities Paper • 2410.11190 • Published 26 days ago • 20