File size: 4,508 Bytes
231fb41
 
 
 
cf9877c
231fb41
 
 
 
 
 
76864d3
4b44d78
76864d3
8eb6285
 
 
76864d3
90b7bb7
 
4b44d78
76864d3
4b44d78
76864d3
231fb41
 
76864d3
 
 
 
 
4f71621
 
45a615f
 
4f71621
45a615f
 
 
 
 
 
 
 
 
 
 
 
767c22e
45a615f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76864d3
 
8eb6285
 
76864d3
d3fc6a8
 
 
 
 
 
 
 
 
 
 
 
 
76864d3
 
d3fc6a8
76864d3
d3fc6a8
8eb6285
76864d3
 
 
 
8eb6285
d3fc6a8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- computer use
license: mit
language:
- en
base_model:
- microsoft/Florence-2-base
---

# PTA-1: Controlling Computers with Small Models

PTA (Prompt-to-Automation) is a vision language model for computer & phone automation, based on Florence-2. 
With only 270M parameters it outperforms much larger models in GUI text and element localization.
This enables low-latency computer automation with local execution.

▶️ Try the demo at: [AskUI/PTA-1](https://huggingface.co/spaces/AskUI/PTA-1)

**Model Input:** Screenshot + description_of_target_element

**Model Output:** BoundingBox for Target Element

![image](assets/examples.png)


## How to Get Started with the Model

Use the code below to get started with the model.

*Requirements:* torch, timm, einops, Pillow, transformers


```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("AskUI/PTA-1", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("AskUI/PTA-1", trust_remote_code=True)

task_prompt = "<OPEN_VOCABULARY_DETECTION>"
prompt = task_prompt + "description of the target element"

image = Image.open("path to screenshot").convert("RGB")

inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)

generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    do_sample=False,
    num_beams=3,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

parsed_answer = processor.post_process_generation(generated_text, task="<OPEN_VOCABULARY_DETECTION>", image_size=(image.width, image.height))

print(parsed_answer)
```


## Evaluation

**Note:** This is a first version of our evaluation, based on 999 samples (333 samples from each dataset). 
We are still running all models on the full test sets, and we are seeing ±5% deviations for a subset of the models we have already evaluated.

| Model                                      | Parameters | Mean   | agentsea/wave-ui | AskUI/pta-text | ivelin/rico_refexp_combined |
|--------------------------------------------|------------|--------|------------------|----------------|-----------------------------|
| AskUI/PTA-1                                | 0.27B      | 79.98  | 90.69*           | 76.28          | 72.97*                      |
| anthropic.claude-3-5-sonnet-20241022-v2:0  | -          | 70.37  | 82.28            | 83.18          | 45.65                       |
| agentsea/paligemma-3b-ft-waveui-896        | 3.29B      | 57.76  | 70.57*           | 67.87          | 34.83                       |
| Qwen/Qwen2-VL-7B-Instruct                  | 8.29B      | 57.26  | 47.45            | 60.66          | 63.66                       |
| agentsea/paligemma-3b-ft-widgetcap-waveui-448 | 3.29B   | 53.15  | 74.17*           | 53.45          | 31.83                       |
| microsoft/Florence-2-base                  | 0.27B      | 39.44  | 22.22            | 81.38          | 14.71                       |
| microsoft/Florence-2-large                 | 0.82B      | 36.64  | 14.11            | 81.98          | 13.81                       |
| EasyOCR                                    | -          | 29.43  | 3.9              | 75.08          | 9.31                        |
| adept/fuyu-8b                              | 9.41B      | 26.83  | 5.71             | 71.47          | 3.3                         |
| Qwen/Qwen2-VL-2B-Instruct                  | 2.21B      | 23.32  | 17.12            | 26.13          | 26.73                       |
| Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4        | 0.90B      | 18.92  | 10.81            | 22.82          | 23.12                       |


\* Models is known to be trained on the train split of that dataset.

The high benchmark scores for our model are partially due to data bias. 
Therefore, we expect users of the model to fine-tune it according to the data distributions of their use case. 


#### Metrics

Click success rate is calculated as the number of clicks inside the target bounding box relative to all clicks. 
If a model predicts a target bounding box instead of a click coordinate, its center is used as its click prediction.