ykarmesh
/

stable-control-representations

Model card Files Files and versions Community

stable-control-representations / README.md

gunshi

added model

9b2e61d 6 months ago

|

history blame contribute delete

2.96 kB

	---
	license: creativeml-openrail-m
	---
	# Stable Control Representations: Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

	[Paper Link](https://arxiv.org/abs/2405.05852)

	This model repo provides the Stable Diffusion model finetuned on the images from the Something-Something-v2, Epic Kitchen and the Bridge V2 datasets. This repo is related to the [stable-control-representations](https://github.com/ykarmesh/stable-control-representations/) GitHub repo.

	## Abstract

	Vision- and language-guided embodied AI requires a fine-grained understanding of the physical world through language and visual inputs. Such capabilities are difficult to learn solely from task-specific data, which has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding—a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark


	## Citing SCR
	If you use SCR in your research, please cite [the following paper](https://arxiv.org/abs/2405.05852):

	```bibtex
	@inproceedings{gupta2024scr,
	title={Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control},
	author={Gunshi Gupta and Karmesh Yadav and Yarin Gal and Dhruv Batra and Zsolt Kira and Cong Lu and Tim G. J. Rudner},
	year={2024},
	eprint={2405.05852},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	## Acknowledgements
	We are thankful to the creators of [Stable Diffusion](https://huggingface.co/runwayml/stable-diffusion-v1-5) for releasing the model, which has significantly contributed to the progress in the field. Additionally, we extend our thanks to the authors of [Visual Cortex](https://github.com/facebookresearch/eai-vc) for releasing the code for CortexBench evaluations.