-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 26 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 12 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 46 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 28
Collections
Discover the best community collections!
Collections including paper arxiv:2412.09871
-
GenEx: Generating an Explorable World
Paper • 2412.09624 • Published • 77 -
IamCreateAI/Ruyi-Mini-7B
Image-to-Video • Updated • 1.11k • 193 -
Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
Paper • 2412.06016 • Published • 20 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 50
-
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement
Paper • 2411.06558 • Published • 34 -
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
Paper • 2411.09944 • Published • 12 -
Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing
Paper • 2411.19460 • Published • 10 -
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Paper • 2412.05237 • Published • 44
-
Differential Transformer
Paper • 2410.05258 • Published • 167 -
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper • 2412.03555 • Published • 116 -
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper • 2412.04467 • Published • 103 -
o1-Coder: an o1 Replication for Coding
Paper • 2412.00154 • Published • 39