QiushiSun commited on
Commit
4a092aa
·
verified ·
1 Parent(s): 503f7c0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -3
README.md CHANGED
@@ -1,3 +1,155 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ base_model: OpenGVLab/InternVL2-4B
5
+ pipeline_tag: image-text-to-text
6
+ ---
7
+
8
+ # OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
9
+
10
+ <div align="center">
11
+
12
+ [\[🏠Homepage\]](https://qiushisun.github.io/OS-Genesis-Home/) [\[💻Code\]](https://github.com/OS-Copilot/OS-Genesis) [\[📝Paper\]](https://arxiv.org/abs/2412.19723) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)[\[🤗Data\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)
13
+
14
+ </div>
15
+
16
+ ## Overview
17
+ ![os-genesis](https://cdn-uploads.huggingface.co/production/uploads/6064a0eeb1703ddba0d458b9/XvcAh92uvJQglmIu_L_nK.png)
18
+
19
+ We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.
20
+
21
+ ## Quick Start
22
+ OS-Genesis-8B-AW is a mobile action model finetuned from [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B).
23
+
24
+ ### OS-Genesis AW Family Models
25
+ In the following table, we provide an overview of the OS-Genesis AW Family Models used for evaluating the AndroidControl Benchmark.
26
+
27
+ | Model Name | Base Model | Training Data | HF Link |
28
+ | :-------------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :---------------------------------------------------------: |
29
+ | OS-Genesis-4B-AW | [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B) | [OS-Genesis-aw-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_aw_training_data.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-4B-AW) |
30
+ | OS-Genesis-7B-AW | [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | [OS-Genesis-aw-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_aw_training_data.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-7B-AW) |
31
+ | OS-Genesis-8B-AW | [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) | [OS-Genesis-aw-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_aw_training_data.jsonl) | [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-8B-AW) |
32
+
33
+
34
+ ### Inference Example
35
+ First, install the `transformers` library:
36
+
37
+ ```
38
+ pip install transformers
39
+ ```
40
+
41
+ For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html).
42
+
43
+ For evaluating the AndroidWorld Benchmark, please refer to the [**evaluation code**](https://github.com/OS-Copilot/OS-Genesis/tree/main/evaluation/android_world).
44
+
45
+ Inference code example:
46
+ ```python
47
+ import numpy as np
48
+ import torch
49
+ import torchvision.transforms as T
50
+ from PIL import Image
51
+ from torchvision.transforms.functional import InterpolationMode
52
+ from transformers import AutoModel, AutoTokenizer
53
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
54
+ IMAGENET_STD = (0.229, 0.224, 0.225)
55
+
56
+ def build_transform(input_size):
57
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
58
+ transform = T.Compose([
59
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
60
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
61
+ T.ToTensor(),
62
+ T.Normalize(mean=MEAN, std=STD)
63
+ ])
64
+ return transform
65
+
66
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
67
+ best_ratio_diff = float('inf')
68
+ best_ratio = (1, 1)
69
+ area = width * height
70
+ for ratio in target_ratios:
71
+ target_aspect_ratio = ratio[0] / ratio[1]
72
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
73
+ if ratio_diff < best_ratio_diff:
74
+ best_ratio_diff = ratio_diff
75
+ best_ratio = ratio
76
+ elif ratio_diff == best_ratio_diff:
77
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
78
+ best_ratio = ratio
79
+ return best_ratio
80
+
81
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
82
+ orig_width, orig_height = image.size
83
+ aspect_ratio = orig_width / orig_height
84
+
85
+ # calculate the existing image aspect ratio
86
+ target_ratios = set(
87
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
88
+ i * j <= max_num and i * j >= min_num)
89
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
90
+
91
+ # find the closest aspect ratio to the target
92
+ target_aspect_ratio = find_closest_aspect_ratio(
93
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
94
+
95
+ # calculate the target width and height
96
+ target_width = image_size * target_aspect_ratio[0]
97
+ target_height = image_size * target_aspect_ratio[1]
98
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
99
+
100
+ # resize the image
101
+ resized_img = image.resize((target_width, target_height))
102
+ processed_images = []
103
+ for i in range(blocks):
104
+ box = (
105
+ (i % (target_width // image_size)) * image_size,
106
+ (i // (target_width // image_size)) * image_size,
107
+ ((i % (target_width // image_size)) + 1) * image_size,
108
+ ((i // (target_width // image_size)) + 1) * image_size
109
+ )
110
+ # split the image
111
+ split_img = resized_img.crop(box)
112
+ processed_images.append(split_img)
113
+ assert len(processed_images) == blocks
114
+ if use_thumbnail and len(processed_images) != 1:
115
+ thumbnail_img = image.resize((image_size, image_size))
116
+ processed_images.append(thumbnail_img)
117
+ return processed_images
118
+
119
+ def load_image(image_file, input_size=448, max_num=12):
120
+ image = Image.open(image_file).convert('RGB')
121
+ transform = build_transform(input_size=input_size)
122
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
123
+ pixel_values = [transform(image) for image in images]
124
+ pixel_values = torch.stack(pixel_values)
125
+ return pixel_values
126
+
127
+ # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
128
+ path = 'OS-Copilot/OS-Genesis-8B-AW'
129
+ model = AutoModel.from_pretrained(
130
+ path,
131
+ torch_dtype=torch.bfloat16,
132
+ low_cpu_mem_usage=True,
133
+ trust_remote_code=True).eval().cuda()
134
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
135
+
136
+ # set the max number of tiles in `max_num`
137
+ pixel_values = load_image('./web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
138
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
139
+
140
+ question = "<image>\nYou are a GUI task expert, I will provide you with a high-level instruction, an action history, a screenshot with its corresponding accessibility tree.\n High-level instruction: {high_level_instruction}\n Action history: {action_history}\n Accessibility tree: {a11y_tree}\n Please generate the low-level thought and action for the next step."
141
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
142
+ print(f'User: {question}\nAssistant: {response}')
143
+ ```
144
+
145
+
146
+ ## Citation
147
+ If you find this repository helpful, feel free to cite our paper:
148
+ ```bibtex
149
+ @article{sun2024osgenesis,
150
+ title={OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis},
151
+ author={Qiushi Sun and Kanzhi Cheng and Zichen Ding and Chuanyang Jin and Yian Wang and Fangzhi Xu and Zhenyu Wu and Chengyou Jia and Liheng Chen and Zhoumianze Liu and Ben Kao and Guohao Li and Junxian He and Yu Qiao and Zhiyong Wu},
152
+ journal={arXiv preprint arXiv:2412.19723},
153
+ year={2024}
154
+ }
155
+ ```