-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2410.01912
-
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper • 2409.02097 • Published • 31 -
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion
Paper • 2409.11406 • Published • 25 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 121 -
Segment Anything with Multiple Modalities
Paper • 2408.09085 • Published • 21
-
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper • 2406.18790 • Published • 33 -
OmniGen: Unified Image Generation
Paper • 2409.11340 • Published • 106 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 50 -
MonoFormer/MonoFormer_ImageNet_256
Updated • 29 • 4
-
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 11 -
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 25 -
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 33 -
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 92
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Paper • 2406.06525 • Published • 64 -
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
Paper • 2405.21048 • Published • 12 -
Scalable Autoregressive Image Generation with Mamba
Paper • 2408.12245 • Published • 23 -
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
Paper • 2410.08159 • Published • 23