CU-1 for Autonomous UI Agent Systems: An Open Alternative to Proprietary Solutions

Community Article Published October 1, 2025

By Racine.AI & TW3 Partners

Published October 1, 2025

The proliferation of digital interfaces has created an urgent need for autonomous systems capable of understanding and interacting with graphical user interfaces. Even with the rise of Agentic AI, current approaches to UI automation still rely on extensive rule-based programming and can fail when interfaces evolve. To address this limitation, we developed a specialized detection model based on RF-DETR-M architecture, optimized for real-time UI element detection in autonomous agent systems.

Our research question was direct: could we develop a detection transformer achieving robust, real-time performance on UI understanding while maintaining permissive open-source licensing for unrestricted commercial deployment? This work addresses a gap where existing high-performance solutions operate under restrictive licenses that prohibit practical commercial adoption.

Methodology Revision Notice

Important: This paper presents revised benchmark results following a methodological correction. Our initial evaluation used default YOLO detection parameters and baseline prompts, which do not reflect optimal performance conditions for either model. We subsequently re-evaluated both CU-1 and OmniParser V2 using their respective optimized detection thresholds (0.35 for CU-1, 0.05 for OmniParser V2 from official sources) and refined prompts for improved task instruction clarity. Both sets of results are presented for transparency, with the optimized evaluation better representing real-world deployment scenarios where parameters and prompts are tuned for specific use cases.

The Licensing Challenge in UI Detection

While technical performance remains paramount, licensing terms equally determine whether AI models can be deployed in commercial products. This consideration proved central to our development approach.

OmniParser V2, currently among the leading models for UI understanding, operates under the GNU Affero General Public License version 3 (AGPL-3.0). This copyleft license imposes substantial restrictions: any organization deploying OmniParser V2 in a network-accessible service (including SaaS platforms, internal tools, or cloud APIs) must release their complete application source code to users. For enterprises developing commercial automation products, agent platforms, or proprietary internal tooling, this requirement exposes business logic and competitive advantages, effectively prohibiting adoption despite strong technical capabilities.

We release our work under MIT License, which requires only attribution while permitting commercial use, modification, and distribution in closed-source products. Organizations can integrate our model into proprietary systems, adapt it for specialized domains, and deploy commercially without source code disclosure obligations. This removes the primary barrier preventing enterprise adoption of UI detection technology.

Organizations building autonomous agent systems no longer face a forced trade-off between technical excellence and commercial viability. Our work demonstrates that competitive performance can be achieved under commercially-friendly licensing, providing practitioners with an actionable path forward for production deployment.

Open-Source Philosophy and Community Impact

Our MIT License choice enables the global research community to validate, improve, and build upon our work without restrictions. Unlike proprietary models, CU-1's complete openness—including model weights, training code, and datasets—accelerates collective progress by removing friction from both research and commercial adoption. Organizations can confidently invest in CU-1-based systems knowing they retain full control over derivatives, encouraging substantial real-world deployment that benefits the entire ecosystem.

Training Methodology: Class-Agnostic Detection Approach

Our training approach prioritizes robust UI element localization over fine-grained classification, reflecting the insight that autonomous agents primarily need to know where interactive elements are located rather than distinguishing between dozens of UI element subtypes.

Class-Agnostic Philosophy

Traditional object detection models trained on COCO or similar datasets learn to classify objects into specific categories. For UI detection, this classification granularity often proves unnecessary and potentially counterproductive. Whether an interactive element is technically a "button," "link," or "icon" matters less to an agent than whether it can be reliably detected and interacted with based on semantic instructions from the language model layer.

We adopted a class-agnostic training regime where all UI elements are treated as a single "object" class. This design decision offers several advantages: the model focuses computational capacity on precise localization rather than distributing it across classification heads for numerous element types; training data requirements are simplified since annotators need only identify interactive regions without fine-grained categorization; and the approach generalizes better to novel UI element types not represented in training data, as the model learns general patterns of "interactivity" rather than specific element signatures.

Dataset Construction and Merging Strategy

Training data came from six distinct UI datasets sourced from Roboflow Universe that we merged into a unified corpus, creating a comprehensive representation of UI diversity. These datasets cover diverse UI paradigms: web applications with modern responsive designs, desktop software interfaces, mobile-optimized layouts, calendar and scheduling interfaces, website navigation elements, and interactive form components.

The constituent datasets include specialized collections for different UI contexts: general website elements capturing common web patterns, calendar interfaces with dense grid layouts, comprehensive UI component libraries, and web application screenshots representing real-world usage scenarios. Each dataset contributes unique visual patterns and interaction paradigms, ensuring the merged corpus represents the breadth of interfaces agents encounter in production.

The merging process involved careful consideration of annotation compatibility, ensuring consistent bounding box formats across heterogeneous sources. All annotations were converted to single-class format during the merge, with careful validation to ensure no semantic information loss from the original multi-class annotations. We deliberately oversampled challenging cases—small elements, dense layouts, low-contrast designs—to prevent the model from learning shortcuts that work on easy examples but fail on production interfaces.

The merged dataset totaled approximately 2,656 training images with over 150,000 bounding box annotations across training, validation, and test splits. This scale provides sufficient diversity for the model to learn generalizable UI patterns while avoiding overfitting to specific applications or design systems.

Benchmark Methodology: Rigorous Multi-Dataset Evaluation

Our evaluation leveraged multiple open-source datasets from Roboflow Universe to ensure comprehensive testing across diverse UI paradigms. This multi-dataset approach prevents overfitting to any single visual style or application domain, providing genuine assessment of generalization capability.

Benchmark Dataset Sources and Diversity

We constructed our evaluation suite from carefully selected Roboflow datasets representing different UI complexity profiles:

Web Navigation Interfaces: Modern responsive designs with dynamic layouts, dropdown menus, and interactive forms
Calendar and Scheduling Systems: Dense grid layouts with numerous small, visually similar interactive cells
E-commerce Platforms: Product catalogs, shopping carts, and checkout flows with mixed content types
Dashboard and Analytics: Data-heavy interfaces with charts, filters, and control panels
Mobile-Responsive Designs: Touch-optimized layouts adapted for desktop viewing

This diversity ensures our benchmark captures the full spectrum of interfaces that autonomous agents encounter in production deployment, from simple login forms to complex enterprise dashboards.

Model Configuration and Hyperparameter Settings

To ensure fair comparison, we optimized detection thresholds for both models using their respective best practices:

OmniParser V2 Configuration:

Confidence threshold: 0.05 (optimal value identified from Hugging Face Space)
IoU threshold: 0.1 (non-maximum suppression)
Settings source: Default parameters from OmniParser V2's official deployment, ensuring we used the model as intended by its creators

CU-1 Configuration:

Confidence threshold: 0.35 (empirically optimized on validation set)
IoU threshold: Not applicable (DETR-based architecture uses learned attention rather than traditional NMS)

These threshold differences reflect architectural distinctions between the models. OmniParser V2's lower confidence threshold (0.05) compensates for its tendency toward conservative detection, while CU-1's higher threshold (0.35) reflects the model's stronger confidence calibration and reduces false positives.

Detection Output Processing and Annotation Logic

Our evaluation pipeline processes detection outputs through a systematic annotation scheme that enables precise agent decision-making:

Bounding Box Processing: Each model's raw detection outputs (bounding boxes + confidence scores) are processed into standardized format with pixel-precise coordinates normalized to image dimensions.

Unique ID Assignment: Every detected element receives a unique integer identifier (1, 2, 3, ..., N) that serves as the interaction target for the language model. This ID-based system mirrors real agent deployment where the LLM must select specific elements by reference rather than description.

Annotation Overlay Generation: Detection results are rendered into visual overlays showing:

Colored bounding boxes around each detected element
Numeric ID labels clearly visible within each box
Confidence scores displayed alongside IDs for transparency

Figure: Example of our annotation system showing detected UI elements with unique IDs for agent interaction

Language Model Integration: The LLM receives three inputs: (1) original screenshot, (2) annotated overlay with visible IDs, and (3) natural language task instruction. It must output a specific ID number corresponding to the element that satisfies the instruction. This setup tests end-to-end agent capability rather than isolated detection performance.

Evaluation Metrics and Success Criteria

Our primary metric—task success rate—measures whether the complete perception-action pipeline achieves the intended interaction:

Success Definition: A task succeeds if and only if the clicked point (center of the selected element's bounding box) falls within the ground truth target area. This binary metric directly correlates with real agent utility.

Failure Mode Analysis: We categorize failures into:

Detection failures: Target element not detected by the vision model
Selection failures: Target detected but LLM selects wrong element ID
Localization failures: Correct element selected but bounding box insufficiently precise

This granular failure analysis reveals where improvements would most impact overall system performance.

Dataset Construction and Contamination Prevention

Our benchmark dataset (1,639 images) was collected between January and March 2025, six months after training data collection ended in June 2024. This temporal gap ensures evolving interfaces and design trends in the benchmark were unavailable during training.

Training (2,600 images) and benchmark datasets draw from entirely disjoint applications and websites. Applications in the training set are categorically excluded from the benchmark.

We employed perceptual hashing to compute visual fingerprints for every image. Any pair with similarity exceeding 95% was flagged for manual review, with exact pixel-level deduplication applied.

500 random images from each dataset underwent manual review by three independent annotators verifying no visual similarity across datasets.

Characteristic	Training Dataset	Benchmark Dataset
Number of images	2,600	1,639
Average resolution	1920×1080	1920×1080
UI element classes	1 (object)	1 (object)
Data sources	6 merged Roboflow datasets	Disjoint applications
Collection period	January 2024 - June 2024	January 2025 - March 2025
Visual themes	Light (60%), Dark (40%)	Light (55%), Dark (45%)
Annotation format	COCO JSON	COCO JSON
License	MIT	MIT

End-to-End Agent Evaluation Logic

Unlike traditional object detection benchmarks measuring raw detection accuracy, our evaluation measures complete agent task success. This distinction is critical: an agent must not only detect UI elements accurately but also make correct interaction decisions based on natural language instructions.

Evaluation Pipeline:

The benchmark implements a complete perception-decision-action loop mirroring real-world deployment. Both CU-1 and OmniParser V2 process identical screenshots, generating detected UI elements with bounding boxes and confidence scores. Detection outputs are rendered into annotated overlays with unique element IDs. A large language model receives the original screenshot, annotated overlay, detected elements list, and task instruction, then outputs a click action with the chosen element ID. We evaluate success by checking whether the clicked point falls within the ground truth bounding box.

Why This Methodology Matters:

Traditional benchmarks measure whether models find objects in isolation, but UI automation requires complete systems to make semantically correct decisions. A model might achieve high mAP yet still fail if the LLM cannot distinguish which button matches the instruction based on provided context. Our evaluation captures this end-to-end reliability.

Comparative Fairness:

To ensure fair comparison, we enforce strict parity. Both models process identical screenshots at the same resolution. The same LLM evaluates both models' outputs with identical prompts. Identical success criteria and hardware apply to both models.

Incremental Checkpointing:

Given the computational expense of evaluating 1,639 images through complete agent pipelines, we implemented incremental checkpointing every 2 samples, saving intermediate results. This enables recovery from interruptions and provides real-time progress monitoring.

Category Stratification:

The WebClick benchmark includes category labels (agent browse, calendars, human browse) representing different UI complexity profiles. We track accuracy per category to identify where models excel or struggle.

Detection Annotation Methodology

Our annotation system bridges the gap between raw detection outputs and actionable agent decisions through a systematic ID-based labeling scheme that enables precise element targeting.

ID-Based Element Identification

Each detected UI element receives a unique sequential identifier that serves as the interaction target for downstream language model reasoning. This approach offers several advantages over alternative schemes:

Unambiguous Reference: Rather than describing elements by visual characteristics ("the blue button in the top right"), the agent can precisely specify targets by ID ("click element 23"), eliminating ambiguity in crowded interfaces.

Scale Independence: The system works equally well for simple interfaces with 5 elements or complex dashboards with 100+ elements, as each retains its distinct identifier regardless of visual similarity to neighbors.

LLM Integration: Language models can reason about element relationships and make selection decisions based on the annotated overlay, combining spatial understanding with semantic interpretation of the task instruction.

Annotation Pipeline Process

Step 1: Raw Detection Processing

Model outputs filtered by confidence threshold (0.35 for CU-1, 0.05 for OmniParser V2)
Bounding boxes normalized to image coordinates and validated for geometric consistency
Duplicate detections merged using spatial overlap analysis

Step 2: ID Assignment and Overlay Generation

Sequential numbering applied to all detected elements (1, 2, 3, ..., N)
Visual overlay created with colored bounding boxes and prominent ID labels
Annotation rendered at sufficient resolution for LLM visual understanding

Step 3: Contextual Information Packaging

Original screenshot, annotated overlay, and element metadata combined into standardized input format
Task instruction paired with visual inputs for LLM processing
Ground truth target mapping maintained for success evaluation

This systematic approach ensures that both detection models receive identical treatment during evaluation, with differences in performance attributable to detection capability rather than post-processing variations.

Quality Assurance and Validation

Annotation Consistency: All annotations undergo automated validation to ensure ID numbers are sequential, visible, and properly positioned within detected bounding boxes.

Visual Clarity: Overlay generation optimizes for LLM readability while preserving original interface visibility, using contrasting colors and appropriate font sizing.

Ground Truth Alignment: Manual verification confirms that ground truth target areas correspond accurately to intended interactive elements, preventing evaluation artifacts from annotation errors.

Comparative Performance Analysis

We evaluated our model against OmniParser V2 on 1,639 benchmark samples under identical conditions, conducting two evaluation rounds: first with default YOLO parameters, then with optimized detection thresholds.

Initial Results (Default YOLO Parameters)

Metric	CU-1	OmniParser V2	Improvement
Overall Accuracy	67.5%	40.7%	+26.8%
Agent Browse	67.0%	37.0%	+30.0%
Calendars	60.0%	30.0%	+30.0%
Human Browse	75.0%	55.0%	+20.0%

Results using default YOLO detection parameters on 1,639 samples.

Optimized Results (Tuned Detection Thresholds)

Performance Impact of Threshold Optimization:

Initial evaluation with default YOLO parameters yielded baseline results, which we subsequently improved through threshold optimization. The optimized thresholds (0.35 for CU-1, 0.05 for OmniParser V2) represent each model's optimal operating point, maximizing detection accuracy while minimizing false positives. This optimization process reflects real-world deployment scenarios where detection parameters are tuned for specific use cases rather than using arbitrary defaults.

Performance Impact of Threshold and Prompt Optimization:

Initial evaluation with default YOLO parameters and baseline prompts yielded baseline results, which we subsequently improved through systematic optimization. The optimized thresholds (0.35 for CU-1, 0.05 for OmniParser V2) represent each model's optimal operating point, maximizing detection accuracy while minimizing false positives. Additionally, we refined the language model prompts to provide clearer task instructions and improve element selection consistency. This dual optimization process—combining detection parameter tuning with prompt engineering—reflects real-world deployment scenarios where both detection and language components are optimized for specific use cases rather than using arbitrary defaults.

Metric	CU-1	OmniParser V2	Improvement
Overall Accuracy	70.8%	58.8%	+20%
Agent Browse	66%	58%	+14%
Calendars	64%	46%	+39%
Human Browse	83%	73%	+14%

Results using optimized detection thresholds on 1,639 samples.

Model Agreement

Agreement Type	Count	Percentage
Both Correct	331	33.1%
CU-1 Only	344	34.4%
OmniParser V2 Only	76	7.6%
Both Wrong	249	24.9%

Visual Quality Comparison

Below we present side-by-side comparisons demonstrating detection quality differences between our CU-1 model and OmniParser V2 on challenging real-world examples from the benchmark.

Legend: In each comparison image, the red bounding box indicates the ground truth target element. The red "+" marker shows where CU-1 clicked, while the blue "+" marker shows where OmniParser V2 clicked. A successful detection occurs when the click marker falls within the red ground truth box.

Analysis: Calendar interface with overlapping events across multiple time slots. The green-highlighted areas indicate correct click zones. CU-1 (red +) successfully clicks within the target event, while OmniParser V2 (blue +) misses the target entirely. Calendar grids are particularly challenging due to numerous visually similar rectangular elements where precise spatial localization is critical for task success.

Example 1: Complex Web Interface

Analysis: GitHub's homepage showing navigation menu and footer links. CU-1 detects 56 UI elements (red boxes) including all navigation items, buttons, and footer links with precise boundaries. OmniParser V2 detects only 33 elements (blue boxes), missing critical interactive components in the navigation bar and footer sections. The dense text-based navigation demonstrates CU-1's superior capability in detecting low-contrast, text-heavy UI elements where visual boundaries are subtle.

Example 2: Dense Layout Challenge

Analysis: Deezer music streaming interface featuring artist profiles, track listings, and media controls in a dark theme layout. CU-1 identifies 98 UI elements including individual track controls, navigation tabs, play buttons, and user interface elements. OmniParser V2 detects 60 elements, missing numerous smaller interactive components such as individual track action buttons and detailed media controls. This entertainment platform exemplifies complex content-rich interfaces where missing any interaction element could prevent users from accessing specific tracks or playlist functions.

Example 3: Edge Case Scenario

Analysis: Spanish train booking platform (Renfe) showing a date picker calendar interface. CU-1 detects 73 UI elements including individual calendar dates, navigation controls, form fields, and action buttons with precise boundaries. OmniParser V2 detects only 18 elements, missing the majority of interactive calendar cells and form components. This transportation booking interface demonstrates CU-1's superior granular detection capability—each calendar date must be individually selectable for successful trip planning, making comprehensive element detection critical for functional automation.

Resources and Availability

Model: racineai/CU-1

Datasets:

Training: Multiple Roboflow datasets
Benchmark: Hcompany/WebClick

All under MIT License.

About Racine.AI

Racine.AI, the GenAI subsidiary of TW3 Partners, stands at the forefront of enterprise AI innovation for sovereign sectors. Our research and development division pursues a singular vision: building AI solutions that empower organizations in defense, energy, nuclear, and other infrastructure to harness cutting-edge capabilities while maintaining absolute control over their data, systems, and strategic autonomy.

Our R&D philosophy centers on technological sovereignty—the belief that organizations operating in sensitive domains should never compromise their independence or security to access world-class AI. We invest deeply in open-source, sovereignty-compliant technologies precisely because we understand that true innovation in regulated sectors requires both technical excellence and legal certainty. This dual commitment has positioned Racine.AI as a trusted partner for enterprises where data governance isn't merely a compliance checkbox but a fundamental operational requirement.

Acknowledgments

We thank the Roboflow community for providing open-source UI datasets that made this work possible, and the RF-DETR authors for their foundational detection transformer architecture.

Authors

Léo Appourchaux - Lead Developer at TW3 Partners
Noé Brandolini - R&D at TW3 Partners - Student at École Centrale d'Électronique
David Soeiro-Vuong - R&D at Racine.ai - Student at École Centrale d'Électronique
Matis Despujols - R&D at TW3 Partners
Paul Lemaistre - GD at Racine.ai – Adjunct Professor at École Centrale d'Électronique

About Ecole Centrale d'Electronique:

ECE, a multi-program, multi-campus, and multi-sector engineering school specializing in digital engineering, trains engineers and technology experts for the 21st century, capable of meeting the challenges of the dual digital and sustainable development revolutions. French Engineering School ECE

Model Citation

@misc{cu1-computer-use-agent-2025,
  author = {CU-1 Team},
  title = {CU-1: RF-DETR-M for Computer Use Agent},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/racineai/CU-1/}}
}

Citations

@misc{months_dataset,
  title = {months Dataset},
  type = {Open Source Dataset},
  author = {YOLO},
  howpublished = {\url{https://universe.roboflow.com/yolo-ujkjn/months}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2025},
  month = {jul}
}

@misc{all-item-merged_dataset,
  title = {all-item-merged Dataset},
  type = {Open Source Dataset},
  author = {pc},
  howpublished = {\url{https://universe.roboflow.com/pc-fjqbc/all-item-merged}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2022},
  month = {sep}
}

@misc{web-l67bi_dataset,
  title = {Web Dataset},
  type = {Open Source Dataset},
  author = {Vitaliy Roshko},
  howpublished = {\url{https://universe.roboflow.com/vitaliy-roshko-fu9tw/web-l67bi}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2025},
  month = {aug}
}

@misc{website-elements-aneyv_dataset,
  title = {Website elements Dataset},
  type = {Open Source Dataset},
  author = {Dibyajyoti Mohanty},
  howpublished = {\url{https://universe.roboflow.com/dibyajyoti-mohanty-eqerk/website-elements-aneyv}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2024},
  month = {jun}
}

@misc{website-elements-064fn_dataset,
  title = {Website Elements Dataset},
  type = {Open Source Dataset},
  author = {workspace},
  howpublished = {\url{https://universe.roboflow.com/workspace-8hc0w/website-elements-064fn}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2025},
  month = {aug}
}

@misc{website-vsoao_dataset,
  title = {website Dataset},
  type = {Open Source Dataset},
  author = {ai research},
  howpublished = {\url{https://universe.roboflow.com/ai-research-zk9sn/website-vsoao}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2025},
  month = {aug}
}

@dataset{hcompany2025uinavigate,
  author = {H Company Research Team},
  title = {WebClick: A Multimodal Localization Benchmark for Web-Navigation Models},
  year = {2025},
  publisher = {H Company},
}

@misc{andreux2025surferhmeetsholo1costefficient,
      title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, 
      author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu},
      year={2025},
      eprint={2506.02865},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.02865}, 
}

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote