AskUI
/

PTA-1

Image-Text-to-Text

text-generation

Model card Files Files and versions Community

PTA-1 / README.md

maxiw's picture

Update README.md

231fb41 verified about 1 month ago

|

4.29 kB

	---
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- vision
	license: mit
	language:
	- en
	base_model:
	- microsoft/Florence-2-base
	---

	# PTA-1: Controlling Computers with Small Models

	PTA (Prompt-to-Action) is a vision language model for computer use applications based on Florence-2.
	With less than 300M parameters it beats larger models in GUI text and element localization.
	This allows low latency computer automations with local execution.

	Model Input: Screenshot + description_of_target_element

	Model Output: BoundingBox for Target Element

	![image](assets/examples.png)


	## How to Get Started with the Model

	Use the code below to get started with the model.


	```python
	from PIL import Image
	from transformers import AutoProcessor, AutoModelForCausalLM

	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	model = AutoModelForCausalLM.from_pretrained("AskUI/PTA-1", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
	processor = AutoProcessor.from_pretrained("AskUI/PTA-1", trust_remote_code=True)

	task_prompt = "<OPEN_VOCABULARY_DETECTION>"
	prompt = task_prompt + "description of the target element"

	image = Image.open("path to screenshot")

	inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)

	generated_ids = model.generate(
	input_ids=inputs["input_ids"],
	pixel_values=inputs["pixel_values"],
	max_new_tokens=1024,
	do_sample=False,
	num_beams=3,
	)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

	parsed_answer = processor.post_process_generation(generated_text, task="<OPEN_VOCABULARY_DETECTION>", image_size=(image.width, image.height))

	print(parsed_answer)
	```


	## Evaluation

	Note: This is a first version of our evaluation with 999 samples (333 samples from each dataset).
	We are still running all models on the full test sets. We are seeing +-5% deviations for a subset of the models we have already evaluated.

	\| Model \| Parameters \| Mean \| agentsea/wave-ui \| AskUI/pta-text \| ivelin/rico_refexp_combined \|
	\|--------------------------------------------\|------------\|--------\|------------------\|----------------\|-----------------------------\|
	\| AskUI/PTA-1 \| 0.27B \| 79.98 \| 90.69* \| 76.28 \| 72.97* \|
	\| anthropic.claude-3-5-sonnet-20241022-v2:0 \| - \| 70.37 \| 82.28 \| 83.18 \| 45.65 \|
	\| agentsea/paligemma-3b-ft-waveui-896 \| 3.29B \| 57.76 \| 70.57* \| 67.87 \| 34.83 \|
	\| Qwen/Qwen2-VL-7B-Instruct \| 8.29B \| 57.26 \| 47.45 \| 60.66 \| 63.66 \|
	\| agentsea/paligemma-3b-ft-widgetcap-waveui-448 \| 3.29B \| 53.15 \| 74.17* \| 53.45 \| 31.83 \|
	\| microsoft/Florence-2-base \| 0.27B \| 39.44 \| 22.22 \| 81.38 \| 14.71 \|
	\| microsoft/Florence-2-large \| 0.82B \| 36.64 \| 14.11 \| 81.98 \| 13.81 \|
	\| EasyOCR \| - \| 29.43 \| 3.9 \| 75.08 \| 9.31 \|
	\| adept/fuyu-8b \| 9.41B \| 26.83 \| 5.71 \| 71.47 \| 3.3 \|
	\| Qwen/Qwen2-VL-2B-Instruct \| 2.21B \| 23.32 \| 17.12 \| 26.13 \| 26.73 \|
	\| Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4 \| 0.90B \| 18.92 \| 10.81 \| 22.82 \| 23.12 \|


	\* Models is known to be trained on the train split of that dataset.

	The high benchmark scores for our model are partially due to data bias.
	Therefore we expect users of the model to fine-tune it according to the data distributions of their use case.


	#### Metrics

	Click success rate is calculated as the number of clicks inside the target bounding box.
	If a model predicts a target bounding box instead of a click coordinate, its center is used as its click prediction.