toread - a CCMat Collection

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Paper • 2311.17049 • Published Nov 28, 2023 • 1

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Paper • 2405.04434 • Published May 7, 2024 • 14

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Paper • 2401.10774 • Published Jan 19, 2024 • 54

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

Paper • 2404.19427 • Published Apr 30, 2024 • 72

CogVLM: Visual Expert for Pretrained Language Models

Paper • 2311.03079 • Published Nov 6, 2023 • 23

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 30

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Paper • 2401.16420 • Published Jan 29, 2024 • 55

InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation

Paper • 2404.02733 • Published Apr 3, 2024 • 21

Demonstration-Regularized RL

Paper • 2310.17303 • Published Oct 26, 2023

Vision Transformers Need Registers

Paper • 2309.16588 • Published Sep 28, 2023 • 78

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Paper • 2405.01434 • Published May 2, 2024 • 54

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Paper • 2404.19752 • Published Apr 30, 2024 • 24

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2, 2024 • 121

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Paper • 2405.00732 • Published Apr 29, 2024 • 120

RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13, 2024 • 67

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 102

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Paper • 2405.08748 • Published May 14, 2024 • 22

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16, 2024 • 28

Many-Shot In-Context Learning in Multimodal Foundation Models

Paper • 2405.09798 • Published May 16, 2024 • 29

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Paper • 2405.10314 • Published May 16, 2024 • 46

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15, 2024 • 88

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16, 2024 • 129

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Paper • 2405.10637 • Published May 17, 2024 • 21

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Paper • 2405.11143 • Published May 20, 2024 • 36

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Paper • 2405.12130 • Published May 20, 2024 • 47

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Paper • 2405.11473 • Published May 19, 2024 • 54

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Paper • 2405.12970 • Published May 21, 2024 • 24

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Paper • 2405.12981 • Published May 21, 2024 • 29

Diffusion for World Modeling: Visual Details Matter in Atari

Paper • 2405.12399 • Published May 20, 2024 • 29

Your Transformer is Secretly Linear

Paper • 2405.12250 • Published May 19, 2024 • 151

ReVideo: Remake a Video with Motion and Content Control

Paper • 2405.13865 • Published May 22, 2024 • 23

Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 31

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 87

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24, 2024 • 44

Improving the Training of Rectified Flows

Paper • 2405.20320 • Published May 30, 2024 • 1

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Paper • 2403.03206 • Published Mar 5, 2024 • 61

BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

Paper • 2406.04333 • Published Jun 6, 2024 • 37

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 73

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

Paper • 2406.04314 • Published Jun 6, 2024 • 28

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Paper • 2406.02657 • Published Jun 4, 2024 • 38

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Paper • 2307.06304 • Published Jul 12, 2023 • 29

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Paper • 2404.14619 • Published Apr 22, 2024 • 127

Multi-Head Mixture-of-Experts

Paper • 2404.15045 • Published Apr 23, 2024 • 60

Pegasus-v1 Technical Report

Paper • 2404.14687 • Published Apr 23, 2024 • 31

Towards Modular LLMs by Building and Reusing a Library of LoRAs

Paper • 2405.11157 • Published May 18, 2024 • 28

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 53

Phased Consistency Model

Paper • 2405.18407 • Published May 28, 2024 • 46

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper • 2405.21060 • Published May 31, 2024 • 64

CRAG -- Comprehensive RAG Benchmark

Paper • 2406.04744 • Published Jun 7, 2024 • 45

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Paper • 2406.08552 • Published Jun 12, 2024 • 24

Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 96

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15, 2024 • 66

EvTexture: Event-driven Texture Enhancement for Video Super-Resolution

Paper • 2406.13457 • Published Jun 19, 2024 • 17

Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

Paper • 2406.12849 • Published Jun 18, 2024 • 50

Adam-mini: Use Fewer Learning Rates To Gain More

Paper • 2406.16793 • Published Jun 24, 2024 • 68

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Paper • 2406.16855 • Published Jun 24, 2024 • 55

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Paper • 2407.01392 • Published Jul 1, 2024 • 40

No Training, No Problem: Rethinking Classifier-Free Guidance for Diffusion Models

Paper • 2407.02687 • Published Jul 2, 2024 • 22

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 93

Video Diffusion Alignment via Reward Gradients

Paper • 2407.08737 • Published Jul 11, 2024 • 48

Qwen2 Technical Report

Paper • 2407.10671 • Published Jul 15, 2024 • 161

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Paper • 2407.20179 • Published Jul 29, 2024 • 47

Gemma 2: Improving Open Language Models at a Practical Size

Paper • 2408.00118 • Published Jul 31, 2024 • 76

The Llama 3 Herd of Models

Paper • 2407.21783 • Published Jul 31, 2024 • 110

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Paper • 2408.00653 • Published Aug 1, 2024 • 29

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 112

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3, 2024 • 79

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Paper • 2408.02657 • Published Aug 5, 2024 • 33

IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

Paper • 2408.03209 • Published Aug 6, 2024 • 22

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Paper • 2408.02718 • Published Aug 5, 2024 • 61

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Paper • 2408.03361 • Published Aug 6, 2024 • 86

An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

Paper • 2408.03178 • Published Aug 6, 2024 • 38

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 60

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8, 2024 • 25

Transformer Explainer: Interactive Learning of Text-Generative Models

Paper • 2408.04619 • Published Aug 8, 2024 • 156

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Paper • 2408.06070 • Published Aug 12, 2024 • 53

Qwen2-Audio Technical Report

Paper • 2407.10759 • Published Jul 15, 2024 • 56

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Paper • 2407.12077 • Published Jul 16, 2024 • 55

Compact Language Models via Pruning and Knowledge Distillation

Paper • 2407.14679 • Published Jul 19, 2024 • 39

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

KAN or MLP: A Fairer Comparison

Paper • 2407.16674 • Published Jul 23, 2024 • 42

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Paper • 2407.16655 • Published Jul 23, 2024 • 30

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 27

MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization

Paper • 2408.02555 • Published Aug 5, 2024 • 29

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29, 2024 • 36

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Paper • 2407.16982 • Published Jul 24, 2024 • 41

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 40

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Paper • 2408.06072 • Published Aug 12, 2024 • 37

Imagen 3

Paper • 2408.07009 • Published Aug 13, 2024 • 61

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 98

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Paper • 2408.10198 • Published Aug 19, 2024 • 33

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 58

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Paper • 2408.11001 • Published Aug 20, 2024 • 12

Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22, 2024 • 90

DreamCinema: Cinematic Transfer with Free Camera and 3D Character

Paper • 2408.12601 • Published Aug 22, 2024 • 29

Scalable Autoregressive Image Generation with Mamba

Paper • 2408.12245 • Published Aug 22, 2024 • 26

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 124

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Paper • 2408.13252 • Published Aug 23, 2024 • 25

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Paper • 2408.14176 • Published Aug 26, 2024 • 61

Foundation Models for Music: A Survey

Paper • 2408.14340 • Published Aug 26, 2024 • 44

Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 123

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 85

CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 57

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29, 2024 • 48

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Paper • 2408.16767 • Published Aug 29, 2024 • 30

CSGO: Content-Style Composition in Text-to-Image Generation

Paper • 2408.16766 • Published Aug 29, 2024 • 18

CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization

Paper • 2408.15914 • Published Aug 28, 2024 • 23

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 33

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Paper • 2409.02634 • Published Sep 4, 2024 • 92

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Paper • 2409.01322 • Published Sep 2, 2024 • 95

Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Paper • 2409.03718 • Published Sep 5, 2024 • 26

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18, 2024 • 39

Dynamic Typography: Bringing Words to Life

Paper • 2404.11614 • Published Apr 17, 2024 • 45

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22, 2024 • 255

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25, 2024 • 77

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 56

Iterative Reasoning Preference Optimization

Paper • 2404.19733 • Published Apr 30, 2024 • 48

KAN: Kolmogorov-Arnold Networks

Paper • 2404.19756 • Published Apr 30, 2024 • 109

OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 109

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

Paper • 2409.08240 • Published Sep 12, 2024 • 19

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Paper • 2409.07452 • Published Sep 11, 2024 • 20

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Paper • 2409.02795 • Published Sep 4, 2024 • 72

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17, 2024 • 29

Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 26

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 140

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 76

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Paper • 2312.14125 • Published Dec 21, 2023 • 45

Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19, 2024 • 136

Imagine yourself: Tuning-Free Personalized Image Generation

Paper • 2409.13346 • Published Sep 20, 2024 • 68

Colorful Diffuse Intrinsic Image Decomposition in the Wild

Paper • 2409.13690 • Published Sep 20, 2024 • 13

Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27, 2024 • 94

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30, 2024 • 55

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Paper • 2409.19603 • Published Sep 29, 2024 • 19

EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis

Paper • 2410.01804 • Published Oct 2, 2024 • 5

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 52

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 36

RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Paper • 2410.01044 • Published Oct 1, 2024 • 34

PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation

Paper • 2410.01680 • Published Oct 2, 2024 • 33

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper • 2410.02073 • Published Oct 2, 2024 • 40

Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 85

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Paper • 2410.08159 • Published Oct 10, 2024 • 25

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Paper • 2410.10306 • Published Oct 14, 2024 • 54

Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

Paper • 2410.11795 • Published Oct 15, 2024 • 17

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 66

SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

Paper • 2410.17249 • Published Oct 22, 2024 • 41

Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17, 2024 • 91

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17, 2024 • 37

FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Paper • 2410.16271 • Published Oct 21, 2024 • 81

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 53

Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 35

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Paper • 2410.17243 • Published Oct 22, 2024 • 89

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Paper • 2410.06940 • Published Oct 9, 2024 • 7

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 145

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17, 2024 • 32

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Paper • 2410.10792 • Published Oct 14, 2024 • 29

CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 200

In-Context LoRA for Diffusion Transformers

Paper • 2410.23775 • Published Oct 31, 2024 • 11

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 37

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Paper • 2411.07232 • Published Nov 11, 2024 • 63

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published Nov 11, 2024 • 47

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Paper • 2411.07126 • Published Nov 11, 2024 • 28

SAMPart3D: Segment Any Part in 3D Objects

Paper • 2411.07184 • Published Nov 11, 2024 • 26

Large Language Models Can Self-Improve in Long-context Reasoning

Paper • 2411.08147 • Published Nov 12, 2024 • 63

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Paper • 2411.08380 • Published Nov 13, 2024 • 25

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Paper • 2411.09595 • Published Nov 14, 2024 • 72

MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published Nov 14, 2024 • 64

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 113

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

Paper • 2411.08033 • Published Nov 12, 2024 • 22

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Paper • 2411.06558 • Published Nov 10, 2024 • 34

Generative World Explorer

Paper • 2411.11844 • Published Nov 18, 2024 • 75

AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16, 2024 • 23

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 48

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Paper • 2411.10818 • Published Nov 16, 2024 • 24

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

Paper • 2411.10958 • Published Nov 17, 2024 • 52

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Paper • 2411.11922 • Published Nov 18, 2024 • 18

Stylecodes: Encoding Stylistic Information For Image Generation

Paper • 2411.12811 • Published Nov 19, 2024 • 11

Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 43

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Paper • 2411.14432 • Published Nov 21, 2024 • 23

Stable Flow: Vital Layers for Training-Free Image Editing

Paper • 2411.14430 • Published Nov 21, 2024 • 14

Style-Friendly SNR Sampler for Style-Driven Generation

Paper • 2411.14793 • Published Nov 22, 2024 • 36

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 49

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Paper • 2411.15466 • Published Nov 23, 2024 • 35

Material Anything: Generating Materials for Any 3D Object via Diffusion

Paper • 2411.15138 • Published Nov 22, 2024 • 42

OminiControl: Minimal and Universal Control for Diffusion Transformer

Paper • 2411.15098 • Published Nov 22, 2024 • 55

World-consistent Video Diffusion with Explicit 3D Modeling

Paper • 2412.01821 • Published Dec 2, 2024 • 4

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

Paper • 2412.01800 • Published Dec 2, 2024 • 6

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Paper • 2411.17459 • Published Nov 26, 2024 • 10

Art-Free Generative Models: Art Creation Without Graphic Art Knowledge

Paper • 2412.00176 • Published Nov 29, 2024 • 8

VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

Paper • 2412.02259 • Published Dec 3, 2024 • 59

Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

Paper • 2411.19943 • Published Nov 29, 2024 • 57

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 124

SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

Paper • 2412.02687 • Published Dec 3, 2024 • 108

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Paper • 2412.03069 • Published Dec 4, 2024 • 30

Imagine360: Immersive 360 Video Generation from Perspective Anchor

Paper • 2412.03552 • Published Dec 4, 2024 • 26

Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Paper • 2412.03515 • Published Dec 4, 2024 • 25

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Paper • 2412.01064 • Published Dec 2, 2024 • 26

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Paper • 2412.00927 • Published Dec 1, 2024 • 26

Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 33

SpotLight: Shadow-Guided Object Relighting via Diffusion

Paper • 2411.18665 • Published Nov 27, 2024 • 3

Video Depth without Video Models

Paper • 2411.19189 • Published Nov 28, 2024 • 34

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Paper • 2411.18350 • Published Nov 27, 2024 • 24

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Paper • 2411.18613 • Published Nov 27, 2024 • 50

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Paper • 2411.15411 • Published Nov 23, 2024 • 7

SketchAgent: Language-Driven Sequential Sketch Generation

Paper • 2411.17673 • Published Nov 26, 2024 • 18

Pathways on the Image Manifold: Image Editing via Video Generation

Paper • 2411.16819 • Published Nov 25, 2024 • 31

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Paper • 2411.17440 • Published Nov 26, 2024 • 35

ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 82

CleanDIFT: Diffusion Features without Noise

Paper • 2412.03439 • Published Dec 4, 2024 • 12

LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Paper • 2412.00177 • Published Nov 29, 2024 • 7

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 105

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 59

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Paper • 2412.04454 • Published Dec 5, 2024 • 59

Structured 3D Latents for Scalable and Versatile 3D Generation

Paper • 2412.01506 • Published Dec 2, 2024 • 55

A Noise is Worth Diffusion Guidance

Paper • 2412.03895 • Published Dec 5, 2024 • 28

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Paper • 2412.04146 • Published Dec 5, 2024 • 22

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 128

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Paper • 2412.04301 • Published Dec 5, 2024 • 34

APOLLO: SGD-like Memory, AdamW-level Performance

Paper • 2412.05270 • Published Dec 6, 2024 • 38

STIV: Scalable Text and Image Conditioned Video Generation

Paper • 2412.07730 • Published Dec 10, 2024 • 71

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Paper • 2412.07774 • Published Dec 10, 2024 • 26

Mobile Video Diffusion

Paper • 2412.07583 • Published Dec 10, 2024 • 20

Video Motion Transfer with Diffusion Transformers

Paper • 2412.07776 • Published Dec 10, 2024 • 17

LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

Paper • 2412.05148 • Published Dec 6, 2024 • 11

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Paper • 2412.07760 • Published Dec 10, 2024 • 50

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Paper • 2412.07744 • Published Dec 10, 2024 • 19

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Paper • 2412.06016 • Published Dec 8, 2024 • 20

Learning Flow Fields in Attention for Controllable Person Image Generation

Paper • 2412.08486 • Published Dec 11, 2024 • 32

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 93

Phi-4 Technical Report

Paper • 2412.08905 • Published Dec 12, 2024 • 104

Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

Paper • 2412.09593 • Published Dec 12, 2024 • 18

Arbitrary-steps Image Super-resolution via Diffusion Inversion

Paper • 2412.09013 • Published Dec 12, 2024 • 11

DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

Paper • 2412.09349 • Published Dec 12, 2024 • 8

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 26

Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published Dec 19, 2024 • 50

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

Paper • 2412.13649 • Published Dec 18, 2024 • 20

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Paper • 2412.17256 • Published 29 days ago • 45

Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 340

LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

Paper • 2412.09622 • Published Dec 12, 2024 • 8

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 139

GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 88

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 35

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Paper • 2412.09626 • Published Dec 12, 2024 • 20

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Paper • 2412.09611 • Published Dec 12, 2024 • 9

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 89

BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 33

ColorFlow: Retrieval-Augmented Image Sequence Colorization

Paper • 2412.11815 • Published Dec 16, 2024 • 26

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Paper • 2412.14171 • Published Dec 18, 2024 • 24

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Paper • 2412.14015 • Published Dec 18, 2024 • 12

Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models

Paper • 2311.13141 • Published Nov 22, 2023 • 13

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published 19 days ago • 97

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Paper • 2501.01427 • Published 18 days ago • 49

LTX-Video: Realtime Video Latent Diffusion

Paper • 2501.00103 • Published 21 days ago • 41

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Paper • 2501.01423 • Published 18 days ago • 36

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published 25 days ago • 80

1.58-bit FLUX

Paper • 2412.18653 • Published 27 days ago • 72

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

Paper • 2412.18605 • Published 27 days ago • 20

DepthLab: From Partial to Complete

Paper • 2412.18153 • Published 28 days ago • 34

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Paper • 2412.17739 • Published 29 days ago • 39

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

Paper • 2412.17758 • Published 28 days ago • 16

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

Paper • 2412.11258 • Published Dec 15, 2024 • 13

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Paper • 2412.11100 • Published Dec 15, 2024 • 6

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

Paper • 2412.08645 • Published Dec 11, 2024 • 11

FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

Paper • 2412.07517 • Published Dec 10, 2024 • 11

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Paper • 2412.09619 • Published Dec 12, 2024 • 22

PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations

Paper • 2412.05994 • Published Dec 8, 2024 • 18

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published Dec 11, 2024 • 45

StreamChat: Chatting with Streaming Video

Paper • 2412.08646 • Published Dec 11, 2024 • 18

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

Paper • 2412.06234 • Published Dec 9, 2024 • 19

Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation

Paper • 2412.07797 • Published Dec 5, 2024 • 11

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

Paper • 2412.08503 • Published Dec 11, 2024 • 8

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Paper • 2412.07720 • Published Dec 10, 2024 • 30

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published Dec 9, 2024 • 19

3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes

Paper • 2411.14974 • Published Nov 22, 2024 • 17

TEXGen: a Generative Diffusion Model for Mesh Textures

Paper • 2411.14740 • Published Nov 22, 2024 • 15

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Paper • 2411.16443 • Published Nov 25, 2024 • 9

Scaling Properties of Diffusion Models for Perceptual Tasks

Paper • 2411.08034 • Published Nov 12, 2024 • 13

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Paper • 2411.04996 • Published Nov 7, 2024 • 50

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

Paper • 2411.04928 • Published Nov 7, 2024 • 49

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Paper • 2411.05003 • Published Nov 7, 2024 • 70

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Paper • 2411.02355 • Published Nov 4, 2024 • 47

How Far is Video Generation from World Model: A Physical Law Perspective

Paper • 2411.02385 • Published Nov 4, 2024 • 33

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Paper • 2411.02265 • Published Nov 4, 2024 • 24

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Paper • 2411.02397 • Published Nov 4, 2024 • 23

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

Paper • 2411.02336 • Published Nov 4, 2024 • 23

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

Paper • 2411.02394 • Published Nov 4, 2024 • 17

GenXD: Generating Any 3D and 4D Scenes

Paper • 2411.02319 • Published Nov 4, 2024 • 20

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Paper • 2410.22366 • Published Oct 28, 2024 • 77

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Paper • 2412.01106 • Published Dec 2, 2024 • 18

Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Paper • 2411.18664 • Published Nov 27, 2024 • 24

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Paper • 2411.18552 • Published Nov 27, 2024 • 17

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

Paper • 2412.04280 • Published Dec 5, 2024 • 13

MV-Adapter: Multi-view Consistent Image Generation Made Easy

Paper • 2412.03632 • Published Dec 4, 2024 • 23

PanoDreamer: 3D Panorama Synthesis from a Single Image

Paper • 2412.04827 • Published Dec 6, 2024 • 10

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Paper • 2412.04440 • Published Dec 5, 2024 • 19

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Paper • 2412.06699 • Published Dec 9, 2024 • 12

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published 13 days ago • 40

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published 15 days ago • 40

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published 14 days ago • 64

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Paper • 2501.03218 • Published 14 days ago • 34

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper • 2501.02976 • Published 15 days ago • 51

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published 11 days ago • 36

OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Paper • 2501.03841 • Published 14 days ago • 49

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published 11 days ago • 65