svjack
/

diffusers-sdxl-controlnet

Model card Files Files and versions Community

diffusers-sdxl-controlnet / examples /research_projects /gligen /README.md

svjack

Upload 1392 files

43b7e92 verified 4 months ago

preview code

raw

history blame

4.67 kB

	# GLIGEN: Open-Set Grounded Text-to-Image Generation

	These scripts contain the code to prepare the grounding data and train the GLIGEN model on COCO dataset.

	### Install the requirements

	```bash
	conda create -n diffusers python==3.10
	conda activate diffusers
	pip install -r requirements.txt
	```

	And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:

	```bash
	accelerate config
	```

	Or for a default accelerate configuration without answering questions about your environment

	```bash
	accelerate config default
	```

	Or if your environment doesn't support an interactive shell e.g. a notebook

	```python
	from accelerate.utils import write_basic_config

	write_basic_config()
	```

	### Prepare the training data

	If you want to make your own grounding data, you need to install the requirements.

	I used [RAM](https://github.com/xinyu1205/recognize-anything) to tag
	images, [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO/issues?q=refer) to detect objects,
	and [BLIP2](https://huggingface.co/docs/transformers/en/model_doc/blip-2) to caption instances.

	Only RAM needs to be installed manually:

	```bash
	pip install git+https://github.com/xinyu1205/recognize-anything.git --no-deps
	```

	Download the pre-trained model:

	```bash
	huggingface-cli download --resume-download xinyu1205/recognize_anything_model ram_swin_large_14m.pth
	huggingface-cli download --resume-download IDEA-Research/grounding-dino-base
	huggingface-cli download --resume-download Salesforce/blip2-flan-t5-xxl
	huggingface-cli download --resume-download clip-vit-large-patch14
	huggingface-cli download --resume-download masterful/gligen-1-4-generation-text-box
	```

	Make the training data on 8 GPUs:

	```bash
	torchrun --master_port 17673 --nproc_per_node=8 make_datasets.py \
	--data_root /mnt/workspace/workgroup/zhizhonghuang/dataset/COCO/train2017 \
	--save_root /root/gligen_data \
	--ram_checkpoint /root/.cache/huggingface/hub/models--xinyu1205--recognize_anything_model/snapshots/ebc52dc741e86466202a5ab8ab22eae6e7d48bf1/ram_swin_large_14m.pth
	```

	You can download the COCO training data from

	```bash
	huggingface-cli download --resume-download Hzzone/GLIGEN_COCO coco_train2017.pth
	```

	It's in the format of

	```json
	[
	...
	{
	'file_path': Path,
	'annos': [
	{
	'caption': Instance
	Caption,
	'bbox': bbox
	in
	xyxy,
	'text_embeddings_before_projection': CLIP
	text
	embedding
	before
	linear
	projection
	}
	]
	}
	...
	]
	```

	### Training commands

	The training script is heavily based
	on https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py

	```bash
	accelerate launch train_gligen_text.py \
	--data_path /root/data/zhizhonghuang/coco_train2017.pth \
	--image_path /mnt/workspace/workgroup/zhizhonghuang/dataset/COCO/train2017 \
	--train_batch_size 8 \
	--max_train_steps 100000 \
	--checkpointing_steps 1000 \
	--checkpoints_total_limit 10 \
	--learning_rate 5e-5 \
	--dataloader_num_workers 16 \
	--mixed_precision fp16 \
	--report_to wandb \
	--tracker_project_name gligen \
	--output_dir /root/data/zhizhonghuang/ckpt/GLIGEN_Text_Retrain_COCO
	```

	I trained the model on 8 A100 GPUs for about 11 hours (at least 24GB GPU memory). The generated images will follow the
	layout possibly at 50k iterations.

	Note that although the pre-trained GLIGEN model has been loaded, the parameters of `fuser` and `position_net` have been reset (see line 420 in `train_gligen_text.py`)

	The trained model can be downloaded from

	```bash
	huggingface-cli download --resume-download Hzzone/GLIGEN_COCO config.json diffusion_pytorch_model.safetensors
	```

	You can run `demo.ipynb` to visualize the generated images.

	Example prompts:

	```python
	prompt = 'A realistic image of landscape scene depicting a green car parking on the left of a blue truck, with a red air balloon and a bird in the sky'
	boxes = [[0.041015625, 0.548828125, 0.453125, 0.859375],
	[0.525390625, 0.552734375, 0.93359375, 0.865234375],
	[0.12890625, 0.015625, 0.412109375, 0.279296875],
	[0.578125, 0.08203125, 0.857421875, 0.27734375]]
	gligen_phrases = ['a green car', 'a blue truck', 'a red air balloon', 'a bird']
	```

	Example images:
	![alt text](generated-images-100000-00.png)

	### Citation

	```
	@article{li2023gligen,
	title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
	author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
	journal={CVPR},
	year={2023}
	}
	```