QiushiSun commited on
Commit
5d7b271
·
verified ·
1 Parent(s): 4bde4fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -24
README.md CHANGED
@@ -5,40 +5,42 @@ base_model: OpenGVLab/InternVL2-4B
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
8
- # OS-Atlas: A Foundation Action Model For Generalist GUI Agents
9
 
10
  <div align="center">
11
 
12
- [\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045)[\[🤗Data\]](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
13
 
14
  </div>
15
 
16
  ## Overview
17
- ![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)
18
 
19
- OS-Atlas provides a series of models specifically designed for GUI agents.
20
 
21
- For GUI grounding tasks, you can use:
22
- - [OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B)
23
- - [OS-Atlas-Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B)
24
 
25
- For generating single-step actions in GUI agent tasks, you can use:
26
- - [OS-Atlas-Pro-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-7B)
27
- - [OS-Atlas-Pro-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-4B)
28
 
29
- ## Quick Start
30
- OS-Atlas-Base-4B is a GUI grounding model finetuned from [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B).
 
 
 
31
 
32
- **Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
33
 
34
  ### Inference Example
35
  First, install the `transformers` library:
 
36
  ```
37
  pip install transformers
38
  ```
39
- For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html)
40
 
41
- Then download the [example image](https://github.com/OS-Copilot/OS-Atlas/blob/main/examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png) and save it to the current directory.
 
 
42
 
43
  Inference code example:
44
  ```python
@@ -135,21 +137,19 @@ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast
135
  pixel_values = load_image('./web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
136
  generation_config = dict(max_new_tokens=1024, do_sample=True)
137
 
138
- question = "In the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions(with point).\n\"'Champions League' link\""
139
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
140
  print(f'User: {question}\nAssistant: {response}')
141
  ```
142
 
143
 
144
-
145
-
146
  ## Citation
147
  If you find this repository helpful, feel free to cite our paper:
148
  ```bibtex
149
- @article{wu2024atlas,
150
- title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
151
- author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
152
- journal={arXiv preprint arXiv:2410.23218},
153
- year={2024}
154
- }
155
  ```
 
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
8
+ # OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
9
 
10
  <div align="center">
11
 
12
+ [\[🏠Homepage\]](https://qiushisun.github.io/OS-Genesis-Home/) [\[💻Code\]](https://github.com/OS-Copilot/OS-Genesis) [\[📝Paper\]](https://arxiv.org/abs/2412.19723) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)[\[🤗Data\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)
13
 
14
  </div>
15
 
16
  ## Overview
17
+ ![os-genesis](https://cdn-uploads.huggingface.co/production/uploads/6064a0eeb1703ddba0d458b9/XvcAh92uvJQglmIu_L_nK.png)
18
 
19
+ We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.
20
 
21
+ ## Quick Start
22
+ OS-Genesis-4B-AC is a mobile action model finetuned from [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B).
 
23
 
24
+ ### OS-Genesis AC Family Models
25
+ In the following table, we provide an overview of the OS-Genesis AC Family Models used for evaluating the AndroidControl Benchmark.
 
26
 
27
+ | Model Name | Base Model | Training Data | HF Link |
28
+ | :-------------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :---------------------------------------------------------: |
29
+ | OS-Genesis-4B-AC | [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B) | [OS-Genesis-mobile-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-4B-AC) |
30
+ | OS-Genesis-7B-AC | [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | [OS-Genesis-mobile-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-7B-AC) |
31
+ | OS-Genesis-8B-AC | [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) | [OS-Genesis-mobile-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-8B-AC) |
32
 
 
33
 
34
  ### Inference Example
35
  First, install the `transformers` library:
36
+
37
  ```
38
  pip install transformers
39
  ```
 
40
 
41
+ For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html).
42
+
43
+ For evaluating the AndroidControl Benchmark, please refer to the [**evaluation code**](https://github.com/OS-Copilot/OS-Genesis/tree/main/evaluation/android_control).
44
 
45
  Inference code example:
46
  ```python
 
137
  pixel_values = load_image('./web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
138
  generation_config = dict(max_new_tokens=1024, do_sample=True)
139
 
140
+ question = "<image> You are a GUI task expert, I will provide you with a high-level instruction, an action history, a screenshot with its corresponding accessibility tree.\n High-level instruction: {high_level_instruction}\n Action history: {action_history}\n Accessibility tree: {a11y_tree}\n Please generate the low-level thought and action for the next step."
141
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
142
  print(f'User: {question}\nAssistant: {response}')
143
  ```
144
 
145
 
 
 
146
  ## Citation
147
  If you find this repository helpful, feel free to cite our paper:
148
  ```bibtex
149
+ @article{sun2024osgenesis,
150
+ title={OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis},
151
+ author={Qiushi Sun and Kanzhi Cheng and Zichen Ding and Chuanyang Jin and Yian Wang and Fangzhi Xu and Zhenyu Wu and Chengyou Jia and Liheng Chen and Zhoumianze Liu and Ben Kao and Guohao Li and Junxian He and Yu Qiao and Zhiyong Wu},
152
+ journal={arXiv preprint arXiv:2412.19723},
153
+ year={2024}
154
+ }
155
  ```