OS-Genesis-7B-AC / README.md

Update README.md

6ef855f verified 19 days ago

5.43 kB

	---
	license: apache-2.0
	library_name: transformers
	base_model: Qwen/Qwen2-VL-7B-Instruct
	pipeline_tag: image-text-to-text
	---

	# OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

	<div align="center">

	[\[🏠Homepage\]](https://qiushisun.github.io/OS-Genesis-Home/) [\[💻Code\]](https://github.com/OS-Copilot/OS-Genesis) [\[📝Paper\]](https://arxiv.org/abs/2412.19723) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)[\[🤗Data\]](https://huggingface.co/collections/OS-Copilot/os-genesis-6768d4b6fffc431dbf624c2d)

	</div>

	## Overview
	![os-genesis](https://cdn-uploads.huggingface.co/production/uploads/6064a0eeb1703ddba0d458b9/XvcAh92uvJQglmIu_L_nK.png)

	We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.

	## Quick Start
	OS-Genesis-7B-AC is a mobile action model finetuned from [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

	### OS-Genesis AC Family Models
	In the following table, we provide an overview of the OS-Genesis AC Family Models used for evaluating the AndroidControl Benchmark.

	\| Model Name \| Base Model \| Training Data \| HF Link \|
	\| :-------------: \| :-------------------------------------------------------------------------------------: \| :----------------------------------------------------------------------------: \| :---------------------------------------------------------: \|
	\| OS-Genesis-4B-AC \| [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B) \| [OS-Genesis-ac-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_ac_training_data.jsonl) \| [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-4B-AC) \|
	\| OS-Genesis-7B-AC \| [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) \| [OS-Genesis-ac-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_ac_training_data.jsonl) \| [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-7B-AC) \|
	\| OS-Genesis-8B-AC \| [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) \| [OS-Genesis-ac-training-data](https://huggingface.co/datasets/OS-Copilot/OS-Genesis-mobile-data/blob/main/os_genesis_ac_training_data.jsonl) \| [🤗 link](https://huggingface.co/OS-Copilot/OS-Genesis-8B-AC) \|


	### Inference Example
	First, ensure that the necessary dependencies are installed:
	```
	pip install transformers
	pip install qwen-vl-utils
	```
	For evaluating the AndroidControl Benchmark, please refer to the [evaluation code](https://github.com/OS-Copilot/OS-Genesis/tree/main/evaluation/android_control).

	Inference code example:
	```python
	from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# Default: Load the model on the available device(s)
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"OS-Copilot/OS-Genesis-7B-AC", torch_dtype="auto", device_map="auto"
	)
	processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "./web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
	},
	{"type": "text", "text": "You are a GUI task expert, I will provide you with a high-level instruction, an action history, a screenshot with its corresponding accessibility tree.\n High-level instruction: {high_level_instruction}\n Action history: {action_history}\n Accessibility tree: {a11y_tree}\n Please generate the low-level thought and action for the next step."},
	],
	}
	]


	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=128)

	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]

	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
	)
	print(output_text)
	# <\|object_ref_start\|>language switch<\|object_ref_end\|><\|box_start\|>(576,12),(592,42)<\|box_end\|><\|im_end\|>
	```



	## Citation
	If you find this repository helpful, feel free to cite our paper:
	```bibtex
	@article{sun2024genesis,
	title={OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis},
	author={Sun, Qiushi and Cheng, Kanzhi and Ding, Zichen and Jin, Chuanyang and Wang, Yian and Xu, Fangzhi and Wu, Zhenyu and Jia, Chengyou and Chen, Liheng and Liu, Zhoumianze and others},
	journal={arXiv preprint arXiv:2412.19723},
	year={2024}
	}
	```