Moondream3 and Salesforce GTA-1 for UI grounding in computer-use agents
Highly requested by the Discord community, we tested Moondream3 and Salesforce GTA-1 for UI grounding in computer-use agents
The numbers on ScreenSpot-v2 benchmark:
GTA-1 leads in accuracy (96% vs 84%), but Moondream3 is 2x faster (1.04s vs 1.97s avg).
The median time gap is even bigger: 0.78s vs 1.96s - that's a 2.5x speedup.
Both models are open-weight, self-hostable and work out-of-the-box with Cua: https://github.com/trycua/cua
Run the benchmark yourself: https://docs.trycua.com/docs/agent-sdk/benchmarks/screenspot-v2
FAQ: What's UI grounding?
It's the task of enabling a Vision-Language Model (VLM) to accurately identify and locate specific visual elements on a GUI based on a natural language query.
Example: "Click the Settings button" → VLM outputs coordinates (x, y)