osunlp
/

UGround

Image-Text-to-Text

Model card Files Files and versions Community

UGround / README.md

nielsr's picture

nielsr HF staff

Add pipeline tag

d800789 verified 2 months ago

|

1.72 kB

	---
	license: llama2
	pipeline_tag: image-text-to-text
	---

	# UGround

	UGround is a storng GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details.
	![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png)
	- Homepage: https://osu-nlp-group.github.io/UGround/
	- Repository: https://github.com/OSU-NLP-Group/UGround
	- Paper: https://arxiv.org/abs/2410.05243
	- Demo: https://huggingface.co/spaces/orby-osu/UGround
	- Point of Contact: [Boyu Gou](mailto:gou.43@osu.edu)


	- [x] Model Weights
	- [ ] Code
	- [ ] Inference Code of UGround
	- [x] Offline Experiments
	- [x] Screenspot (along with referring expressions generated by GPT-4/4o)
	- [x] Multimodal-Mind2Web
	- [x] OmniAct
	- [ ] Online Experiments
	- [ ] Mind2Web-Live
	- [ ] AndroidWorld
	- [ ] Data
	- [ ] Data Examples
	- [ ] Data Construction Scripts
	- [ ] Guidance of Open-source Data
	- [x] Online Demo (HF Spaces)

	## Citation Information

	If you find this work useful, please consider citing our papers:

	```
	@article{gou2024uground,
	title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
	author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
	journal={arXiv preprint arXiv:2410.05243},
	year={2024},
	url={https://arxiv.org/abs/2410.05243},
	}

	@article{zheng2023seeact,
	title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
	author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
	journal={arXiv preprint arXiv:2401.01614},
	year={2024},
	}
	```