M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models Paper • 2411.04075 • Published 18 days ago • 14
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding Paper • 2411.04952 • Published 17 days ago • 27
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models Paper • 2411.05005 • Published 17 days ago • 13
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning Paper • 2411.05003 • Published 17 days ago • 69
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper • 2411.04996 • Published 17 days ago • 48
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? Paper • 2411.05000 • Published 17 days ago • 21
Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model Paper • 2411.04496 • Published 18 days ago • 22
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion Paper • 2411.04928 • Published 17 days ago • 48
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Paper • 2411.04905 • Published 17 days ago • 109
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models Paper • 2408.15518 • Published Aug 28 • 42