Moondream3 and Salesforce GTA-1 for UI grounding in computer-use agents

Community Article Published October 9, 2025

Highly requested by the Discord community, we tested Moondream3 and Salesforce GTA-1 for UI grounding in computer-use agents

The numbers on ScreenSpot-v2 benchmark:

GTA-1 leads in accuracy (96% vs 84%), but Moondream3 is 2x faster (1.04s vs 1.97s avg).

The median time gap is even bigger: 0.78s vs 1.96s - that's a 2.5x speedup.

Both models are open-weight, self-hostable and work out-of-the-box with Cua: https://github.com/trycua/cua

Run the benchmark yourself: https://docs.trycua.com/docs/agent-sdk/benchmarks/screenspot-v2

FAQ: What's UI grounding?

It's the task of enabling a Vision-Language Model (VLM) to accurately identify and locate specific visual elements on a GUI based on a natural language query.

Example: "Click the Settings button" → VLM outputs coordinates (x, y)

Community

Sign up or log in to comment