maxiw commited on
Commit
d3fc6a8
1 Parent(s): 4b44d78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -148
README.md CHANGED
@@ -1,9 +1,14 @@
1
- ---
2
- library_name: transformers
3
- pipeline_tag: image-text-to-text
4
- tags:
5
- - vision
6
- ---
 
 
 
 
 
7
 
8
  # PTA-1: Controlling Computers with Small Models
9
 
@@ -15,63 +20,6 @@ This allows low latency computer automations with local execution.
15
 
16
  **Model Output:** BoundingBox for Target Element
17
 
18
- ## Model Details
19
-
20
- ### Model Description
21
-
22
- <!-- Provide a longer summary of what this model is. -->
23
-
24
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
25
-
26
- - **Developed by:** [More Information Needed]
27
- - **Funded by [optional]:** [More Information Needed]
28
- - **Shared by [optional]:** [More Information Needed]
29
- - **Model type:** [More Information Needed]
30
- - **Language(s) (NLP):** [More Information Needed]
31
- - **License:** [More Information Needed]
32
- - **Finetuned from model [optional]:** [More Information Needed]
33
-
34
- ### Model Sources [optional]
35
-
36
- <!-- Provide the basic links for the model. -->
37
-
38
- - **Repository:** [More Information Needed]
39
- - **Paper [optional]:** [More Information Needed]
40
- - **Demo [optional]:** [More Information Needed]
41
-
42
- ## Uses
43
-
44
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
45
-
46
- ### Direct Use
47
-
48
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
49
-
50
- [More Information Needed]
51
-
52
- ### Downstream Use [optional]
53
-
54
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
55
-
56
- [More Information Needed]
57
-
58
- ### Out-of-Scope Use
59
-
60
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
61
-
62
- [More Information Needed]
63
-
64
- ## Bias, Risks, and Limitations
65
-
66
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
67
-
68
- [More Information Needed]
69
-
70
- ### Recommendations
71
-
72
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
73
-
74
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
75
 
76
  ## How to Get Started with the Model
77
 
@@ -110,98 +58,33 @@ print(parsed_answer)
110
  ```
111
 
112
 
113
-
114
-
115
- ## Training Details
116
-
117
- ### Training Data
118
-
119
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
120
-
121
- [More Information Needed]
122
-
123
- ### Training Procedure
124
-
125
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
126
-
127
- #### Preprocessing [optional]
128
-
129
- [More Information Needed]
130
-
131
-
132
- #### Training Hyperparameters
133
-
134
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
135
-
136
- #### Speeds, Sizes, Times [optional]
137
-
138
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
139
-
140
- [More Information Needed]
141
-
142
  ## Evaluation
143
 
144
- <!-- This section describes the evaluation protocols and provides the results. -->
145
-
146
- ### Testing Data, Factors & Metrics
147
 
148
- #### Testing Data
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
- <!-- This should link to a Dataset Card if possible. -->
151
 
152
- [More Information Needed]
153
 
154
- #### Factors
 
155
 
156
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
157
-
158
- [More Information Needed]
159
 
160
  #### Metrics
161
 
162
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
163
-
164
- [More Information Needed]
165
-
166
- ### Results
167
-
168
- [More Information Needed]
169
-
170
- #### Summary
171
-
172
-
173
- ## Technical Specifications [optional]
174
-
175
- ### Model Architecture and Objective
176
-
177
- [More Information Needed]
178
-
179
- ### Compute Infrastructure
180
-
181
- [More Information Needed]
182
-
183
- #### Hardware
184
-
185
- [More Information Needed]
186
-
187
- #### Software
188
-
189
- [More Information Needed]
190
-
191
- ## Citation [optional]
192
-
193
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
194
-
195
- **BibTeX:**
196
-
197
- [More Information Needed]
198
-
199
- **APA:**
200
-
201
- [More Information Needed]
202
-
203
- ## Glossary [optional]
204
-
205
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
206
-
207
- [More Information Needed]
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: image-text-to-text
4
+ tags:
5
+ - vision
6
+ license: mit
7
+ language:
8
+ - en
9
+ base_model:
10
+ - microsoft/Florence-2-base
11
+ ---
12
 
13
  # PTA-1: Controlling Computers with Small Models
14
 
 
20
 
21
  **Model Output:** BoundingBox for Target Element
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## How to Get Started with the Model
25
 
 
58
  ```
59
 
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ## Evaluation
62
 
63
+ **Note:** This is a first version of our evaluation with 999 samples (333 samples from each dataset).
64
+ We are still running all models on the full test sets. We are seeing +-5% deviations for a subset of the models we have already evaluated.
 
65
 
66
+ | Model | Parameters | Mean | agentsea/wave-ui | AskUI/pta-text | ivelin/rico_refexp_combined |
67
+ |--------------------------------------------|------------|--------|------------------|----------------|-----------------------------|
68
+ | AskUI/PTA-1 | 0.27B | 79.98 | 90.69* | 76.28 | 72.97* |
69
+ | anthropic.claude-3-5-sonnet-20241022-v2:0 | - | 70.37 | 82.28 | 83.18 | 45.65 |
70
+ | agentsea/paligemma-3b-ft-waveui-896 | 3.29B | 57.76 | 70.57* | 67.87 | 34.83 |
71
+ | Qwen/Qwen2-VL-7B-Instruct | 8.29B | 57.26 | 47.45 | 60.66 | 63.66 |
72
+ | agentsea/paligemma-3b-ft-widgetcap-waveui-448 | 3.29B | 53.15 | 74.17* | 53.45 | 31.83 |
73
+ | microsoft/Florence-2-base | 0.27B | 39.44 | 22.22 | 81.38 | 14.71 |
74
+ | microsoft/Florence-2-large | 0.82B | 36.64 | 14.11 | 81.98 | 13.81 |
75
+ | EasyOCR | - | 29.43 | 3.9 | 75.08 | 9.31 |
76
+ | adept/fuyu-8b | 9.41B | 26.83 | 5.71 | 71.47 | 3.3 |
77
+ | Qwen/Qwen2-VL-2B-Instruct | 2.21B | 23.32 | 17.12 | 26.13 | 26.73 |
78
+ | Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4 | 0.90B | 18.92 | 10.81 | 22.82 | 23.12 |
79
 
 
80
 
81
+ \* Models is known to be trained on the train split of that dataset.
82
 
83
+ The high benchmark scores for our model are partially due to data bias.
84
+ Therefore we expect users of the model to fine-tune it according to the data distributions of their use case.
85
 
 
 
 
86
 
87
  #### Metrics
88
 
89
+ Click success rate is calculated as the number of clicks inside the target bounding box.
90
+ If a model predicts a target bounding box instead of a click coordinate, its center is used as its click prediction.