SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models Paper • 2510.06917 • Published 10 days ago • 34
EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing Paper • 2509.13399 • Published Sep 16 • 4
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Paper • 2303.11381 • Published Mar 20, 2023 • 2
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation Paper • 2303.12346 • Published Mar 22, 2023 • 1
Equivariant Similarity for Vision-Language Foundation Models Paper • 2303.14465 • Published Mar 25, 2023
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation Paper • 2304.06671 • Published Apr 13, 2023
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) Paper • 2309.17421 • Published Sep 29, 2023 • 4
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation Paper • 2310.07749 • Published Oct 11, 2023 • 5
MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos Paper • 2306.04216 • Published Jun 7, 2023
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models Paper • 2312.13503 • Published Dec 21, 2023
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training Paper • 2401.00849 • Published Jan 1, 2024 • 17
DisCo: Disentangled Control for Referring Human Dance Generation in Real World Paper • 2307.00040 • Published Jun 30, 2023 • 25
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA Paper • 2109.05014 • Published Sep 10, 2021 • 1
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling Paper • 2111.12085 • Published Nov 23, 2021
GIT: A Generative Image-to-text Transformer for Vision and Language Paper • 2205.14100 • Published May 27, 2022 • 1
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition Paper • 2403.12339 • Published Mar 19, 2024
PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 Paper • 2211.09699 • Published Nov 15, 2022 • 2