Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25 • 103
CLEAR: Character Unlearning in Textual and Visual Modalities Paper • 2410.18057 • Published Oct 23 • 200
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior Paper • 2410.21264 • Published 26 days ago • 8
Visual Context Window Extension: A New Perspective for Long Video Understanding Paper • 2409.20018 • Published Sep 30 • 9
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Paper • 2409.12961 • Published Sep 19 • 24
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing Paper • 2409.01322 • Published Sep 2 • 94
CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis Paper • 2407.13301 • Published Jul 18 • 54
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding Paper • 2407.15754 • Published Jul 22 • 19
AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents Paper • 2407.04363 • Published Jul 5 • 26
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published Jul 3 • 92