File size: 6,636 Bytes
853f54c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
datasets:
- youliangtan/bridge_dataset
language:
- en
license: mit
metrics:
- accuracy
- bleu
tags:
- Robot Control
- Generalist robot policies
- VLA
- Embodied AI
- Unified Model
- multimodal
- large embodied model
pipeline_tag: robotics
library_name: transformers
---



<p align="left">
  <a href="https://eo-robotics.ai/eo-1">
    <img
      src="https://img.shields.io/badge/EO--Robotics-Website-5865F2?logo=googleplay&logoColor=white"
      alt="EO-Robotics Website"
    />
  </a>
  <a href="https://arxiv.org/abs/2508.21112">
    <img
      src="https://img.shields.io/badge/EO--1-Paper-red?logo=arxiv&logoColor=red"
      alt="EO-Robotics Paper on arXiv"
    />
  </a>
  <a href="https://github.com/EO-Robotics/EO1">
    <img 
        src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" 
        alt="GitHub Code"
    />
  </a>
  <a href="https://huggingface.co/collections/IPEC-COMMUNITY/eo-robotics-68ac4ff30e1f746cac28ca14">
    <img 
        src="https://img.shields.io/badge/EO--1--3B-Model-FFCC11?logo=huggingface&logoColor=brightyellow" 
        alt="EO-1 Model"
    />
  </a>
  <a href="https://huggingface.co/spaces/IPEC-COMMUNITY/EO-Robotics">
    <img 
        src="https://img.shields.io/badge/EO--Robotics-Space-orange?logo=huggingface&logoColor=brightyellow" 
        alt="EO-Robotics Model"
    />
  </a>
  <a href="https://discord.gg/JqfDs6va">
    <img
      src="https://img.shields.io/badge/EO--Robotics-Discord-155dfc?logo=discord&logoColor=lightblue"
      alt="EO-Robotics Discord"
    />
  </a>
  <a href="mailto:wangdong@pjlab.org.cn">
    <img
      src="https://img.shields.io/badge/EO--Robotics-Email-D14836?logo=gmail&logoColor=red"
      alt="EO-Robotics Email"
    />
  </a>
  <a href="https://huggingface.co/datasets/IPEC-COMMUNITY/EO-Data1.5M">
    <img
      src="https://img.shields.io/badge/Dataset-EO--Data1.5M-brightgreen?logo=huggingface&logoColor=brightyellow"
      alt="EO-1.5M"
    />
  </a>
</p>

## Interleaved Vision-Text-Action Pretraining for General Robot Control

We introduce **EO-1** model, an open-source unified embodied foundation model comprising 3B parameters, trained on the carefully curated interleaved embodied dataset EO-Data1.5M, Web Multimodal Data, and Robot Control Data (AgiBotWorld, Open X-Embodiment, RoboMIND, SO100-Community, etc.). The **EO-1** model adopt a single unified decoder-only transformer that integrates discrete auto-regressive decoding with continuous flow matching denoising for multimodal embodied reasoning and robot control, enabling seamless perception, planning, reasoning, and acting in single model. This work highlights the following features:

- ⚑ **Unified Architecture**: A single decoder-only transformer integrating text, image, video, and actions.
- πŸ“š **EO-1.5M Dataset**: 1.5M high-quality interleaved samples (Physical, Reasoning, Spatial, Control).
- πŸŒ€ **Interleaved Pretraining**: Seamless synergy between language and action with autoregressive + flow matching.
- πŸ€– **Reasoning-Enhanced Generalization**: Superior generalization capabilities with multimodal embodied reasoning and real robot control.


## 0. Model Architecture



**EO-1** model is a Vision-Language-Action (VLA) model that adopts a single unified decoder-only transformer, equipping with discrete language-modeling head for multimodal embodied reasoning and continuous flow-matching head for robot action generation. The language instruction, image observations, robot state, and noisy action are encoded into an interleaved token sequence of tokens to be processed by the shared transformer backbone, whose weights are initialized from Qwen2.5-VL. The model is trained on interleaved vision-text-action data with a combination of flow-matching objective and next-token-prediction objective and capable of seamless embodied reasoning and acting.

### Input:
Input Type:

- Vision: Image Frames, Video
- State: Robot Proprioception
- Language Instruction: Text, Pointing, Bounding Box, etc.
- Input Format:
    - Vision: Variable number of uint8 image frames or long video sequence
    - State: Floating Point
    - Language Instruction: String

### Output:

Output Type(s): Actions, Language

Output Format: Continuous-value vectors, Discrete Text


## 1. Inference with pre-trained model
**EO-1** is built entirely on πŸ€— HuggingFace Transformers and Lerobot, making deployment straightforward and accessible. If your environment supports transformers and lerobot, you can load the model and run inference directly with just a few lines of code (requires ~6.5GB GPU memory). **EO-1** unifies high-level embodied reasoning with low-level robot control, producing either natural language outputs or actionable robot commands.

```python
from transformers import AutoModel, AutoProcessor
# load the model and processor
processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/eo1-qwen25_vl-bridge", trust_remote_code=True)
model = AutoModel.from_pretrained(
  "IPEC-COMMUNITY/eo1-qwen25_vl-bridge", 
  trust_remote_code=True, 
  torch_dtype=torch.bfloat16
).eval().cuda()

# prepare the model input
batch = {
    "observation.images.image_0": [img], # PIL.Image
    "observation.state": [state],
    "task": ["You are a helpful physical agent equipped with both reasoning and robotic control. \
      You see the Tic-Tac-Toe board, think strategically, act logically, and block threats."]
}

ov_output = processor.select_action(
	new_model,
	batch,
)

print(ov_output)
```

## 2. Benchmark


WidowX Benchmark

| Model        | Put Spoon on Towel | Put Carrot on Plate | Stack Blocks | Put Eggplant in Basket | Overall   |
| ------------ | ------------------ | ------------------- | ------------ | ---------------------- | --------- |
| $\pi_0$      | 0.838              | 0.525               | 0.525        | 0.879                  | 0.692     |
| $\pi_0$-fast | 0.291              | 0.219               | 0.108        | 0.666                  | 0.321     |
| **EO-1**     | **0.636**          | **0.545**           | **0.818**    | **0.909**              | **0.727** |

## πŸ“š 3. Citation

If you find this project useful, please consider citing:

```bibtex
@article{eo1,
  title={EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control},
  author={Delin Qu and Haoming Song and Qizhi Chen and Zhaoqing Chen and Xianqiang Gao and Xinyi Ye and Qi Lv and Modi Shi and Guanghui Ren and Cheng Ruan and Maoqing Yao and Haoran Yang and Jiacheng Bao and Bin Zhao and Dong Wang},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2508.21112}
}
```