File size: 4,666 Bytes
43b7e92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# GLIGEN: Open-Set Grounded Text-to-Image Generation

These scripts contain the code to prepare the grounding data and train the GLIGEN model on COCO dataset.

### Install the requirements

```bash
conda create -n diffusers python==3.10
conda activate diffusers
pip install -r requirements.txt
```

And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:

```bash
accelerate config
```

Or for a default accelerate configuration without answering questions about your environment

```bash
accelerate config default
```

Or if your environment doesn't support an interactive shell e.g. a notebook

```python
from accelerate.utils import write_basic_config

write_basic_config()
```

### Prepare the training data

If you want to make your own grounding data, you need to install the requirements.

I used [RAM](https://github.com/xinyu1205/recognize-anything) to tag
images, [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO/issues?q=refer) to detect objects,
and [BLIP2](https://huggingface.co/docs/transformers/en/model_doc/blip-2) to caption instances.

Only RAM needs to be installed manually:

```bash
pip install git+https://github.com/xinyu1205/recognize-anything.git --no-deps
```

Download the pre-trained model:

```bash
huggingface-cli download --resume-download xinyu1205/recognize_anything_model ram_swin_large_14m.pth
huggingface-cli download --resume-download IDEA-Research/grounding-dino-base
huggingface-cli download --resume-download Salesforce/blip2-flan-t5-xxl
huggingface-cli download --resume-download clip-vit-large-patch14
huggingface-cli download --resume-download masterful/gligen-1-4-generation-text-box
```

Make the training data on 8 GPUs:

```bash
torchrun --master_port 17673 --nproc_per_node=8 make_datasets.py \
    --data_root /mnt/workspace/workgroup/zhizhonghuang/dataset/COCO/train2017 \
    --save_root /root/gligen_data \
    --ram_checkpoint /root/.cache/huggingface/hub/models--xinyu1205--recognize_anything_model/snapshots/ebc52dc741e86466202a5ab8ab22eae6e7d48bf1/ram_swin_large_14m.pth
```

You can download the COCO training data from

```bash
huggingface-cli download --resume-download Hzzone/GLIGEN_COCO coco_train2017.pth
```

It's in the format of

```json
[
  ...
  {
    'file_path': Path,
    'annos': [
      {
        'caption': Instance
        Caption,
        'bbox': bbox
        in
        xyxy,
        'text_embeddings_before_projection': CLIP
        text
        embedding
        before
        linear
        projection
      }
    ]
  }
  ...
]
```

### Training commands

The training script is heavily based
on https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py

```bash
accelerate launch train_gligen_text.py \
    --data_path /root/data/zhizhonghuang/coco_train2017.pth \
    --image_path /mnt/workspace/workgroup/zhizhonghuang/dataset/COCO/train2017 \
    --train_batch_size 8 \
    --max_train_steps 100000 \
    --checkpointing_steps 1000 \
    --checkpoints_total_limit 10 \
    --learning_rate 5e-5 \
    --dataloader_num_workers 16 \
    --mixed_precision fp16 \
    --report_to wandb \
    --tracker_project_name gligen \
    --output_dir /root/data/zhizhonghuang/ckpt/GLIGEN_Text_Retrain_COCO
```

I trained the model on 8 A100 GPUs for about 11 hours (at least 24GB GPU memory). The generated images will follow the
layout possibly at 50k iterations.

Note that although the pre-trained GLIGEN model has been loaded, the parameters of `fuser` and `position_net` have been reset (see line 420 in `train_gligen_text.py`)

The trained model can be downloaded from

```bash
huggingface-cli download --resume-download Hzzone/GLIGEN_COCO config.json diffusion_pytorch_model.safetensors
```

You can run `demo.ipynb` to visualize the generated images.

Example prompts:

```python
prompt = 'A realistic image of landscape scene depicting a green car parking on the left of a blue truck, with a red air balloon and a bird in the sky'
boxes = [[0.041015625, 0.548828125, 0.453125, 0.859375],
         [0.525390625, 0.552734375, 0.93359375, 0.865234375],
         [0.12890625, 0.015625, 0.412109375, 0.279296875],
         [0.578125, 0.08203125, 0.857421875, 0.27734375]]
gligen_phrases = ['a green car', 'a blue truck', 'a red air balloon', 'a bird']
```

Example images:
![alt text](generated-images-100000-00.png)

### Citation

```
@article{li2023gligen,
  title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
  author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
  journal={CVPR},
  year={2023}
}
```