Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs Paper • 2404.15406 • Published Apr 23, 2024
The (R)Evolution of Multimodal Large Language Models: A Survey Paper • 2402.12451 • Published Feb 19, 2024
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization Paper • 2408.14547 • Published Aug 26, 2024
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval Paper • 2503.01980 • Published Mar 3, 2025
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning Paper • 2503.15621 • Published Mar 19, 2025
Recurrence Meets Transformers for Universal Multimodal Retrieval Paper • 2509.08897 • Published Sep 10, 2025
Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization Paper • 2508.20181 • Published Aug 27, 2025
Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models Paper • 2512.15885 • Published Dec 17, 2025
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering Paper • 2511.22715 • Published Mar 31 • 3
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering Paper • 2511.22715 • Published Mar 31 • 3