--- license: llama2 pipeline_tag: image-text-to-text --- # UGround UGround is a storng GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. ![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png) - **Homepage:** https://osu-nlp-group.github.io/UGround/ - **Repository:** https://github.com/OSU-NLP-Group/UGround - **Paper:** https://arxiv.org/abs/2410.05243 - **Demo:** https://huggingface.co/spaces/orby-osu/UGround - **Point of Contact:** [Boyu Gou](mailto:gou.43@osu.edu) - [x] Model Weights - [ ] Code - [ ] Inference Code of UGround - [x] Offline Experiments - [x] Screenspot (along with referring expressions generated by GPT-4/4o) - [x] Multimodal-Mind2Web - [x] OmniAct - [ ] Online Experiments - [ ] Mind2Web-Live - [ ] AndroidWorld - [ ] Data - [ ] Data Examples - [ ] Data Construction Scripts - [ ] Guidance of Open-source Data - [x] Online Demo (HF Spaces) ## Citation Information If you find this work useful, please consider citing our papers: ``` @article{gou2024uground, title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents}, author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su}, journal={arXiv preprint arXiv:2410.05243}, year={2024}, url={https://arxiv.org/abs/2410.05243}, } @article{zheng2023seeact, title={GPT-4V(ision) is a Generalist Web Agent, if Grounded}, author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su}, journal={arXiv preprint arXiv:2401.01614}, year={2024}, } ```