Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2508.02324

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

about 19 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

Paper • 2507.21809 • Published Jul 29 • 124
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

Paper • 2507.06165 • Published Jul 8 • 55
DINOv3

Paper • 2508.10104 • Published 23 days ago • 236
Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241

about 1 month ago

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241

Image Generation

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

Paper • 2506.07977 • Published Jun 9 • 41
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Paper • 2506.07986 • Published Jun 9 • 19
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Paper • 2506.06276 • Published Jun 6 • 22
Aligning Latent Spaces with Flow Priors

Paper • 2506.05240 • Published Jun 5 • 27

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Paper • 2504.12626 • Published Apr 17 • 52
Qwen3 Technical Report

Paper • 2505.09388 • Published May 14 • 285
Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241
DINOv3

Paper • 2508.10104 • Published 23 days ago • 236

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241

Foundation model

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Paper • 2508.02193 • Published Aug 4 • 129

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241
Qwen/Qwen-Image

Text-to-Image • Updated 19 days ago • 172k • • 1.97k

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

Image Generation

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

Paper • 2506.07977 • Published Jun 9 • 41
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Paper • 2506.07986 • Published Jun 9 • 19
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Paper • 2506.06276 • Published Jun 6 • 22
Aligning Latent Spaces with Flow Priors

Paper • 2506.05240 • Published Jun 5 • 27

about 19 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Paper • 2504.12626 • Published Apr 17 • 52
Qwen3 Technical Report

Paper • 2505.09388 • Published May 14 • 285
Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241
DINOv3

Paper • 2508.10104 • Published 23 days ago • 236

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

Paper • 2507.21809 • Published Jul 29 • 124
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

Paper • 2507.06165 • Published Jul 8 • 55
DINOv3

Paper • 2508.10104 • Published 23 days ago • 236
Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241

Foundation model

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Paper • 2508.02193 • Published Aug 4 • 129

about 1 month ago

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241

Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 241
Qwen/Qwen-Image

Text-to-Image • Updated 19 days ago • 172k • • 1.97k

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs