UGround / README.md

nielsr HF staff

Add pipeline tag

d800789 verified 2 months ago

preview code

raw

history blame

1.72 kB

metadata

license: llama2
pipeline_tag: image-text-to-text

UGround

UGround is a storng GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details.

Homepage: https://osu-nlp-group.github.io/UGround/
Repository: https://github.com/OSU-NLP-Group/UGround
Paper: https://arxiv.org/abs/2410.05243
Demo: https://huggingface.co/spaces/orby-osu/UGround
Point of Contact: Boyu Gou
Model Weights
Code
- Inference Code of UGround
- Offline Experiments
  - Screenspot (along with referring expressions generated by GPT-4/4o)
  - Multimodal-Mind2Web
  - OmniAct
- Online Experiments
  - Mind2Web-Live
  - AndroidWorld
Data
- Data Examples
- Data Construction Scripts
- Guidance of Open-source Data
Online Demo (HF Spaces)

Citation Information

If you find this work useful, please consider citing our papers:

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }