juliet shen's picture

5

juliet shen

juliets

·

AI & ML interests

None yet

Recent Activity

liked a Space 10 days ago

UNESCO/nllb

Reacted to singhsidhukuldeep's post with 🔥 26 days ago

Good folks from @Microsoft have released an exciting breakthrough in GUI automation! OmniParser – a game-changing approach for pure vision-based GUI agents that works across multiple platforms and applications. Key technical innovations: - Custom-trained interactable icon detection model using 67k screenshots from popular websites - Specialized BLIP-v2 model fine-tuned on 7k icon-description pairs for extracting functional semantics - Novel combination of icon detection, OCR, and semantic understanding to create structured UI representations The results are impressive: - Outperforms GPT-4V baseline by significant margins on the ScreenSpot benchmark - Achieves 73% accuracy on Mind2Web without requiring HTML data - Demonstrates a 57.7% success rate on AITW mobile tasks What makes OmniParser special is its ability to work across platforms (mobile, desktop, web) using only screenshot data – no HTML or view hierarchy needed. This opens up exciting possibilities for building truly universal GUI automation tools. The team has open-sourced both the interactable region detection dataset and icon description dataset to accelerate research in this space. Kudos to the Microsoft Research team for pushing the boundaries of what's possible with pure vision-based GUI understanding! What are your thoughts on vision-based GUI automation?

liked a model about 2 months ago

meta-llama/Llama-Guard-3-1B

View all activity

Organizations

juliets's activity

liked a Space 10 days ago

Running on Zero

NLLB

Reacted to singhsidhukuldeep's post with 🔥 26 days ago

Post

2570

Good folks from @Microsoft have released an exciting breakthrough in GUI automation!

OmniParser – a game-changing approach for pure vision-based GUI agents that works across multiple platforms and applications.

Key technical innovations:
- Custom-trained interactable icon detection model using 67k screenshots from popular websites
- Specialized BLIP-v2 model fine-tuned on 7k icon-description pairs for extracting functional semantics
- Novel combination of icon detection, OCR, and semantic understanding to create structured UI representations

The results are impressive:
- Outperforms GPT-4V baseline by significant margins on the ScreenSpot benchmark
- Achieves 73% accuracy on Mind2Web without requiring HTML data
- Demonstrates a 57.7% success rate on AITW mobile tasks

What makes OmniParser special is its ability to work across platforms (mobile, desktop, web) using only screenshot data – no HTML or view hierarchy needed. This opens up exciting possibilities for building truly universal GUI automation tools.

The team has open-sourced both the interactable region detection dataset and icon description dataset to accelerate research in this space.

Kudos to the Microsoft Research team for pushing the boundaries of what's possible with pure vision-based GUI understanding!

What are your thoughts on vision-based GUI automation?

liked 2 models about 2 months ago

meta-llama/Llama-Guard-3-1B

Text Generation • Updated Sep 26 • 14k • 54

openai/whisper-large-v3-turbo

Automatic Speech Recognition • Updated Oct 4 • 1.83M • • 1.4k

liked a model 3 months ago

google/vit-large-patch16-224

Image Classification • Updated Jun 23, 2022 • 29.1k • 25

liked a Space 4 months ago

Samidh Cope Dev

updated a Space 4 months ago

README