🏁 Best viewed with sound on

F1: A Vision Language Action Model Bridging
Understanding and Generation to Actions

Paper Code Website

πŸš€ Key Innovations

  • 🧠 Predictive Inverse Dynamics: Visual foresight generation for planning-based control
  • πŸ—οΈ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
  • πŸ“ˆ Three-Stage Training: Progressive alignment, pretraining, and adaptation

πŸ€– Real-World Robot Experiments

Diverse manipulation tasks across multiple robot platforms.

πŸ“Š Performance Summary

Task Platform F1 Ο€0 Improvement
Multi-task Genie-1 82.2% 65.2% +17.0%
Adaptation Franka 66.7% 53.3% +13.4%
Long-horizon ARX LIFT II 40.0% 0.0% +40.0%
Dynamic Env ARX LIFT II 66.7% 33.3% +33.4%

Usage

Please refer to our official repo F1-VLA.

πŸ“š Citation

If you find our work helpful, please cite:

@article{f1_vla_2025,
  title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions},
  author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang},
  journal={Conference/Journal Name},
  year={2025},
  url={https://arxiv.org/abs/2509.06951}
}

License

This work is under the cc-by-nc-sa-4.0.

Acknowledgements

This repository is based on Lerobot, Any4lerobot, and VAR.

Downloads last month
53
Safetensors
Model size
4.19B params
Tensor type
I64
Β·
F32
Β·
BF16
Β·
Video Preview
loading

Collection including InternRobotics/F1-VLA