Jonathan Chang
commited on
Commit
•
fdb04eb
1
Parent(s):
4101a24
Update model card
Browse files
README.md
CHANGED
@@ -1,3 +1,280 @@
|
|
1 |
---
|
2 |
-
license:
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: openrail++
|
3 |
+
tags:
|
4 |
+
- stable-diffusion
|
5 |
+
- text-to-image
|
6 |
+
pinned: true
|
7 |
---
|
8 |
+
|
9 |
+
# Model Card for flex-diffusion-2-1
|
10 |
+
|
11 |
+
<!-- Provide a quick summary of what the model is/does. [Optional] -->
|
12 |
+
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.
|
13 |
+
|
14 |
+
## TLDR:
|
15 |
+
|
16 |
+
### There are 2 models in this repo:
|
17 |
+
- One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
|
18 |
+
- One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.
|
19 |
+
|
20 |
+
For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
21 |
+
|
22 |
+
### It aims to solve the following issues:
|
23 |
+
1. Generated images looks like they are cropped from a larger image.
|
24 |
+
Examples:
|
25 |
+
|
26 |
+
2. Generating non-square images creates weird results, due to the model being trained on square images.
|
27 |
+
Examples:
|
28 |
+
|
29 |
+
|
30 |
+
### Limitations:
|
31 |
+
1. It's trained on a small dataset, so it's improvements may be limited.
|
32 |
+
2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions.
|
33 |
+
For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.
|
34 |
+
|
35 |
+
### Potential improvements:
|
36 |
+
1. Train on a larger dataset.
|
37 |
+
2. Train on different resolutions even for the same aspect ratio.
|
38 |
+
3. Train on specific aspect ratios, instead of a range of aspect ratios.
|
39 |
+
|
40 |
+
|
41 |
+
# Table of Contents
|
42 |
+
|
43 |
+
- [Model Card for flex-diffusion-2-1](#model-card-for--model_id-)
|
44 |
+
- [Table of Contents](#table-of-contents)
|
45 |
+
- [Table of Contents](#table-of-contents-1)
|
46 |
+
- [Model Details](#model-details)
|
47 |
+
- [Model Description](#model-description)
|
48 |
+
- [Uses](#uses)
|
49 |
+
- [Direct Use](#direct-use)
|
50 |
+
- [Downstream Use [Optional]](#downstream-use-optional)
|
51 |
+
- [Out-of-Scope Use](#out-of-scope-use)
|
52 |
+
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
53 |
+
- [Recommendations](#recommendations)
|
54 |
+
- [Training Details](#training-details)
|
55 |
+
- [Training Data](#training-data)
|
56 |
+
- [Training Procedure](#training-procedure)
|
57 |
+
- [Preprocessing](#preprocessing)
|
58 |
+
- [Speeds, Sizes, Times](#speeds-sizes-times)
|
59 |
+
- [Evaluation](#evaluation)
|
60 |
+
- [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
|
61 |
+
- [Testing Data](#testing-data)
|
62 |
+
- [Factors](#factors)
|
63 |
+
- [Metrics](#metrics)
|
64 |
+
- [Results](#results)
|
65 |
+
- [Model Examination](#model-examination)
|
66 |
+
- [Environmental Impact](#environmental-impact)
|
67 |
+
- [Technical Specifications [optional]](#technical-specifications-optional)
|
68 |
+
- [Model Architecture and Objective](#model-architecture-and-objective)
|
69 |
+
- [Compute Infrastructure](#compute-infrastructure)
|
70 |
+
- [Hardware](#hardware)
|
71 |
+
- [Software](#software)
|
72 |
+
- [Citation](#citation)
|
73 |
+
- [Glossary [optional]](#glossary-optional)
|
74 |
+
- [More Information [optional]](#more-information-optional)
|
75 |
+
- [Model Card Authors [optional]](#model-card-authors-optional)
|
76 |
+
- [Model Card Contact](#model-card-contact)
|
77 |
+
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
78 |
+
|
79 |
+
|
80 |
+
# Model Details
|
81 |
+
|
82 |
+
## Model Description
|
83 |
+
|
84 |
+
<!-- Provide a longer summary of what this model is/does. -->
|
85 |
+
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.
|
86 |
+
|
87 |
+
finetuned resolutions:
|
88 |
+
| | width | height | aspect ratio |
|
89 |
+
|---:|--------:|---------:|:---------------|
|
90 |
+
| 0 | 512 | 1024 | 1:2 |
|
91 |
+
| 1 | 576 | 1024 | 9:16 |
|
92 |
+
| 2 | 576 | 960 | 3:5 |
|
93 |
+
| 3 | 640 | 1024 | 5:8 |
|
94 |
+
| 4 | 512 | 768 | 2:3 |
|
95 |
+
| 5 | 640 | 896 | 5:7 |
|
96 |
+
| 6 | 576 | 768 | 3:4 |
|
97 |
+
| 7 | 512 | 640 | 4:5 |
|
98 |
+
| 8 | 640 | 768 | 5:6 |
|
99 |
+
| 9 | 640 | 704 | 10:11 |
|
100 |
+
| 10 | 512 | 512 | 1:1 |
|
101 |
+
| 11 | 704 | 640 | 11:10 |
|
102 |
+
| 12 | 768 | 640 | 6:5 |
|
103 |
+
| 13 | 640 | 512 | 5:4 |
|
104 |
+
| 14 | 768 | 576 | 4:3 |
|
105 |
+
| 15 | 896 | 640 | 7:5 |
|
106 |
+
| 16 | 768 | 512 | 3:2 |
|
107 |
+
| 17 | 1024 | 640 | 8:5 |
|
108 |
+
| 18 | 960 | 576 | 5:3 |
|
109 |
+
| 19 | 1024 | 576 | 16:9 |
|
110 |
+
| 20 | 1024 | 512 | 2:1 |
|
111 |
+
|
112 |
+
- **Developed by:** Jonathan Chang
|
113 |
+
- **Model type:** Diffusion-based text-to-image generation model
|
114 |
+
- **Language(s)**: English
|
115 |
+
- **License:** creativeml-openrail-m
|
116 |
+
- **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1
|
117 |
+
- **Resources for more information:** More information needed
|
118 |
+
|
119 |
+
# Uses
|
120 |
+
|
121 |
+
- see https://huggingface.co/stabilityai/stable-diffusion-2-1
|
122 |
+
|
123 |
+
|
124 |
+
# Training Details
|
125 |
+
|
126 |
+
## Training Data
|
127 |
+
|
128 |
+
- LAION aesthetic dataset, subset of it with 6+ rating
|
129 |
+
- https://laion.ai/blog/laion-aesthetics/
|
130 |
+
- https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
|
131 |
+
- I only used a small portion of that, see [Preprocessing](#preprocessing)
|
132 |
+
|
133 |
+
|
134 |
+
- most common aspect ratios in the dataset (before preprocessing)
|
135 |
+
|
136 |
+
| | aspect_ratio | counts |
|
137 |
+
|---:|:---------------|---------:|
|
138 |
+
| 0 | 1:1 | 154727 |
|
139 |
+
| 1 | 3:2 | 119615 |
|
140 |
+
| 2 | 2:3 | 61197 |
|
141 |
+
| 3 | 4:3 | 52276 |
|
142 |
+
| 4 | 16:9 | 38862 |
|
143 |
+
| 5 | 400:267 | 21893 |
|
144 |
+
| 6 | 3:4 | 16893 |
|
145 |
+
| 7 | 8:5 | 16258 |
|
146 |
+
| 8 | 4:5 | 15684 |
|
147 |
+
| 9 | 6:5 | 12228 |
|
148 |
+
| 10 | 1000:667 | 12097 |
|
149 |
+
| 11 | 2:1 | 11006 |
|
150 |
+
| 12 | 800:533 | 10259 |
|
151 |
+
| 13 | 5:4 | 9753 |
|
152 |
+
| 14 | 500:333 | 9700 |
|
153 |
+
| 15 | 250:167 | 9114 |
|
154 |
+
| 16 | 5:3 | 8460 |
|
155 |
+
| 17 | 200:133 | 7832 |
|
156 |
+
| 18 | 1024:683 | 7176 |
|
157 |
+
| 19 | 11:10 | 6470 |
|
158 |
+
|
159 |
+
- predefined aspect ratios
|
160 |
+
|
161 |
+
| | width | height | aspect ratio |
|
162 |
+
|---:|--------:|---------:|:---------------|
|
163 |
+
| 0 | 512 | 1024 | 1:2 |
|
164 |
+
| 1 | 576 | 1024 | 9:16 |
|
165 |
+
| 2 | 576 | 960 | 3:5 |
|
166 |
+
| 3 | 640 | 1024 | 5:8 |
|
167 |
+
| 4 | 512 | 768 | 2:3 |
|
168 |
+
| 5 | 640 | 896 | 5:7 |
|
169 |
+
| 6 | 576 | 768 | 3:4 |
|
170 |
+
| 7 | 512 | 640 | 4:5 |
|
171 |
+
| 8 | 640 | 768 | 5:6 |
|
172 |
+
| 9 | 640 | 704 | 10:11 |
|
173 |
+
| 10 | 512 | 512 | 1:1 |
|
174 |
+
| 11 | 704 | 640 | 11:10 |
|
175 |
+
| 12 | 768 | 640 | 6:5 |
|
176 |
+
| 13 | 640 | 512 | 5:4 |
|
177 |
+
| 14 | 768 | 576 | 4:3 |
|
178 |
+
| 15 | 896 | 640 | 7:5 |
|
179 |
+
| 16 | 768 | 512 | 3:2 |
|
180 |
+
| 17 | 1024 | 640 | 8:5 |
|
181 |
+
| 18 | 960 | 576 | 5:3 |
|
182 |
+
| 19 | 1024 | 576 | 16:9 |
|
183 |
+
| 20 | 1024 | 512 | 2:1 |
|
184 |
+
|
185 |
+
|
186 |
+
## Training Procedure
|
187 |
+
|
188 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
189 |
+
|
190 |
+
### Preprocessing
|
191 |
+
|
192 |
+
|
193 |
+
1. download files with url & caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
|
194 |
+
- I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet`
|
195 |
+
2. use img2dataset to convert to webdataset
|
196 |
+
- https://github.com/rom1504/img2dataset
|
197 |
+
- I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file`
|
198 |
+
- the output folder is `/mnt/aesthetics6plus`, change this to your own folder
|
199 |
+
|
200 |
+
```bash
|
201 |
+
echo INPUT_FOLDER=first-file
|
202 |
+
echo OUTPUT_FOLDER=/mnt/aesthetics6plus
|
203 |
+
img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
|
204 |
+
--url_col "URL" --caption_col "TEXT" --output_format webdataset\
|
205 |
+
--output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
|
206 |
+
--save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
|
207 |
+
```
|
208 |
+
|
209 |
+
3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:
|
210 |
+
- use webdataset to load the data
|
211 |
+
- calculate the aspect ratio of each image
|
212 |
+
- find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
|
213 |
+
- keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
|
214 |
+
- random crop the image to the associated resolution. E.g. crop to 512x1024
|
215 |
+
- if more than 10% of the image is lost in the cropping, discard this example.
|
216 |
+
- batch examples by aspect ratio, so all examples in a batch have the same aspect ratio
|
217 |
+
|
218 |
+
|
219 |
+
### Speeds, Sizes, Times
|
220 |
+
|
221 |
+
- Dataset size: 100k image-caption pairs, before filtering.
|
222 |
+
- I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger.
|
223 |
+
|
224 |
+
- Hardware: 1 RTX3090 GPUs
|
225 |
+
|
226 |
+
- Optimizer: 8bit Adam
|
227 |
+
|
228 |
+
- Batch size: 32
|
229 |
+
- actual batch size: 2
|
230 |
+
- gradient_accumulation_steps: 16
|
231 |
+
- effective batch size: 32
|
232 |
+
|
233 |
+
- Learning rate: warmup to 2e-6 for 500 steps and then kept constant
|
234 |
+
|
235 |
+
- Learning rate: 2e-6
|
236 |
+
- Training steps: 6k
|
237 |
+
- Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
|
238 |
+
- Each example is seen 1.92 times on average.
|
239 |
+
|
240 |
+
- Training time: approximately 1 day
|
241 |
+
|
242 |
+
## Results
|
243 |
+
|
244 |
+
More information needed
|
245 |
+
|
246 |
+
# Model Card Authors
|
247 |
+
|
248 |
+
Jonathan Chang
|
249 |
+
|
250 |
+
|
251 |
+
# How to Get Started with the Model
|
252 |
+
|
253 |
+
Use the code below to get started with the model.
|
254 |
+
|
255 |
+
|
256 |
+
```python
|
257 |
+
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel
|
258 |
+
|
259 |
+
def use_DPM_solver(pipe):
|
260 |
+
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
|
261 |
+
return pipe
|
262 |
+
|
263 |
+
pipe = StableDiffusionPipeline.from_pretrained(
|
264 |
+
"stabilityai/stable-diffusion-2-1",
|
265 |
+
unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
|
266 |
+
torch_dtype=torch.float16,
|
267 |
+
)
|
268 |
+
# for v2-base, use the following line instead
|
269 |
+
#pipe = StableDiffusionPipeline.from_pretrained(
|
270 |
+
# "stabilityai/stable-diffusion-2-base",
|
271 |
+
# unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
|
272 |
+
# torch_dtype=torch.float16)
|
273 |
+
pipe = use_DPM_solver(pipe).to("cuda")
|
274 |
+
pipe = pipe.to("cuda")
|
275 |
+
|
276 |
+
prompt = "a professional photograph of an astronaut riding a horse"
|
277 |
+
image = pipe(prompt).images[0]
|
278 |
+
|
279 |
+
image.save("astronaut_rides_horse.png")
|
280 |
+
```
|