X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Abstract
A novel Soft Prompt approach enhances Vision-Language-Action models by using learnable embeddings for diverse robotic data, enabling superior performance across simulations and real-world robots.
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/
Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning (2025)
- VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation (2025)
- F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions (2025)
- Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance (2025)
- Contrastive Representation Regularization for Vision-Language-Action Models (2025)
- Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model (2025)
- RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper