QiushiSun commited on
Commit
fa04907
·
verified ·
1 Parent(s): 627a853

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -3
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ base_model: Qwen/Qwen2-VL-7B-Instruct
5
+ pipeline_tag: image-text-to-text
6
+ ---
7
+
8
+ # OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
9
+
10
+ <div align="center">
11
+
12
+ [\[🏠Homepage\]](https://qiushisun.github.io/OS-Genesis-Home/) [\[💻Code\]](https://github.com/OS-Copilot/OS-Genesis) [\[📝Paper\]](https://arxiv.org/abs/2412.19723) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)[\[🤗Data\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)
13
+
14
+ </div>
15
+
16
+ ## Overview
17
+ ![os-genesis](https://cdn-uploads.huggingface.co/production/uploads/6064a0eeb1703ddba0d458b9/XvcAh92uvJQglmIu_L_nK.png)
18
+
19
+ We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.
20
+
21
+ ## Quick Start
22
+ OS-Genesis-7B-WA is a web action model finetuned from [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
23
+
24
+ ### OS-Genesis AC Family Models
25
+ In the following table, we provide an overview of the OS-Genesis AC Family Models used for evaluating the AndroidControl Benchmark.
26
+
27
+ | Model Name | Base Model | Training Data | HF Link |
28
+ | :-------------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :---------------------------------------------------------: |
29
+ | OS-Genesis-4B-WA | [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B) | [OS-Genesis-web-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-web-data/blob/main/os_genesis_web_training.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-4B-WA) |
30
+ | OS-Genesis-7B-WA | [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | [OS-Genesis-web-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-web-data/blob/main/os_genesis_web_training.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-7B-WA) |
31
+ | OS-Genesis-8B-WA | [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) | [OS-Genesis-web-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-web-data/blob/main/os_genesis_web_training.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-8B-WA) |
32
+
33
+
34
+ ### Inference Example
35
+ First, ensure that the necessary dependencies are installed:
36
+ ```
37
+ pip install transformers
38
+ pip install qwen-vl-utils
39
+ ```
40
+ For evaluating the WebArena Benchmark, please refer to the [**evaluation code**](https://github.com/OS-Copilot/OS-Genesis/tree/main/evaluation).
41
+
42
+ Inference code example:
43
+ ```python
44
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
45
+ from qwen_vl_utils import process_vision_info
46
+
47
+ # Default: Load the model on the available device(s)
48
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
49
+ "OS-Copilot/OS-Genesis-7B-AC", torch_dtype="auto", device_map="auto"
50
+ )
51
+ processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")
52
+
53
+ messages = [
54
+ {
55
+ "role": "user",
56
+ "content": [
57
+ {
58
+ "type": "image",
59
+ "image": "./web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
60
+ },
61
+ {"type": "text", "text": "You are a GUI task expert, I will provide you with a high-level instruction, an action history, a screenshot with its corresponding accessibility tree.\n High-level instruction: {high_level_instruction}\n Action history: {action_history}\n Accessibility tree: {a11y_tree}\n Please generate the low-level thought and action for the next step."},
62
+ ],
63
+ }
64
+ ]
65
+
66
+
67
+ # Preparation for inference
68
+ text = processor.apply_chat_template(
69
+ messages, tokenize=False, add_generation_prompt=True
70
+ )
71
+ image_inputs, video_inputs = process_vision_info(messages)
72
+ inputs = processor(
73
+ text=[text],
74
+ images=image_inputs,
75
+ videos=video_inputs,
76
+ padding=True,
77
+ return_tensors="pt",
78
+ )
79
+ inputs = inputs.to("cuda")
80
+
81
+ # Inference: Generation of the output
82
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
83
+
84
+ generated_ids_trimmed = [
85
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
86
+ ]
87
+
88
+ output_text = processor.batch_decode(
89
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
90
+ )
91
+ print(output_text)
92
+ # <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
93
+ ```
94
+
95
+
96
+
97
+ ## Citation
98
+ If you find this repository helpful, feel free to cite our paper:
99
+ ```bibtex
100
+ @article{sun2024osgenesis,
101
+ title={OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis},
102
+ author={Qiushi Sun and Kanzhi Cheng and Zichen Ding and Chuanyang Jin and Yian Wang and Fangzhi Xu and Zhenyu Wu and Chengyou Jia and Liheng Chen and Zhoumianze Liu and Ben Kao and Guohao Li and Junxian He and Yu Qiao and Zhiyong Wu},
103
+ journal={arXiv preprint arXiv:2412.19723},
104
+ year={2024}
105
+ }
106
+ ```