-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arXiv:2510.25889
-
π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
Paper • 2510.25889 • Published • 58 -
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Paper • 2510.27607 • Published • 8 -
A Survey on Efficient Vision-Language-Action Models
Paper • 2510.24795 • Published • 5
-
ARE: Scaling Up Agent Environments and Evaluations
Paper • 2509.17158 • Published • 34 -
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
Paper • 2510.08551 • Published • 31 -
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Paper • 2510.04212 • Published • 22 -
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Paper • 2510.12693 • Published • 26
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 38 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 34 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 44 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 31
-
NousResearch/Hermes-4-70B-FP8
Text Generation • 71B • Updated • 821 • 24 -
NousResearch/Hermes-4-405B-FP8
Text Generation • 406B • Updated • 1.1k • 19 -
Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models
Paper • 2508.21365 • Published • 29 -
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
Paper • 2509.15566 • Published • 14
-
Unified Vision-Language-Action Model
Paper • 2506.19850 • Published • 27 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 141 -
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper • 2403.09631 • Published • 11 -
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Paper • 2312.14457 • Published • 1
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
Paper • 2510.25889 • Published • 58 -
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Paper • 2510.27607 • Published • 8 -
A Survey on Efficient Vision-Language-Action Models
Paper • 2510.24795 • Published • 5
-
ARE: Scaling Up Agent Environments and Evaluations
Paper • 2509.17158 • Published • 34 -
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
Paper • 2510.08551 • Published • 31 -
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Paper • 2510.04212 • Published • 22 -
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Paper • 2510.12693 • Published • 26
-
NousResearch/Hermes-4-70B-FP8
Text Generation • 71B • Updated • 821 • 24 -
NousResearch/Hermes-4-405B-FP8
Text Generation • 406B • Updated • 1.1k • 19 -
Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models
Paper • 2508.21365 • Published • 29 -
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
Paper • 2509.15566 • Published • 14
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 38 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 34 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 44 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 31
-
Unified Vision-Language-Action Model
Paper • 2506.19850 • Published • 27 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 141 -
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper • 2403.09631 • Published • 11 -
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Paper • 2312.14457 • Published • 1