ttj
/

flex-diffusion-2-1

 ---
+license: openrail++
+tags:
+- stable-diffusion
+- text-to-image
+pinned: true
 ---
+# Model Card for flex-diffusion-2-1
+<!-- Provide a quick summary of what the model is/does. [Optional] -->
+stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.
+## TLDR:
+### There are 2 models in this repo:
+- One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
+- One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.
+For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model)
+### It aims to solve the following issues:
+1. Generated images looks like they are cropped from a larger image.
+Examples:
+2. Generating non-square images creates weird results, due to the model being trained on square images.
+Examples:
+### Limitations:
+1. It's trained on a small dataset, so it's improvements may be limited.
+2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions.
+For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.
+### Potential improvements:
+1. Train on a larger dataset.
+2. Train on different resolutions even for the same aspect ratio.
+3. Train on specific aspect ratios, instead of a range of aspect ratios.
+#  Table of Contents
+- [Model Card for flex-diffusion-2-1](#model-card-for--model_id-)
+- [Table of Contents](#table-of-contents)
+- [Table of Contents](#table-of-contents-1)
+- [Model Details](#model-details)
+  - [Model Description](#model-description)
+- [Uses](#uses)
+  - [Direct Use](#direct-use)
+  - [Downstream Use [Optional]](#downstream-use-optional)
+  - [Out-of-Scope Use](#out-of-scope-use)
+- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
+  - [Recommendations](#recommendations)
+- [Training Details](#training-details)
+  - [Training Data](#training-data)
+  - [Training Procedure](#training-procedure)
+    - [Preprocessing](#preprocessing)
+    - [Speeds, Sizes, Times](#speeds-sizes-times)
+- [Evaluation](#evaluation)
+  - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
+    - [Testing Data](#testing-data)
+    - [Factors](#factors)
+    - [Metrics](#metrics)
+  - [Results](#results)
+- [Model Examination](#model-examination)
+- [Environmental Impact](#environmental-impact)
+- [Technical Specifications [optional]](#technical-specifications-optional)
+  - [Model Architecture and Objective](#model-architecture-and-objective)
+  - [Compute Infrastructure](#compute-infrastructure)
+    - [Hardware](#hardware)
+    - [Software](#software)
+- [Citation](#citation)
+- [Glossary [optional]](#glossary-optional)
+- [More Information [optional]](#more-information-optional)
+- [Model Card Authors [optional]](#model-card-authors-optional)
+- [Model Card Contact](#model-card-contact)
+- [How to Get Started with the Model](#how-to-get-started-with-the-model)
+# Model Details
+## Model Description
+<!-- Provide a longer summary of what this model is/does. -->
+stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.
+finetuned resolutions:
+|    |   width |   height | aspect ratio   |
+|---:|--------:|---------:|:---------------|
+|  0 |     512 |     1024 | 1:2            |
+|  1 |     576 |     1024 | 9:16           |
+|  2 |     576 |      960 | 3:5            |
+|  3 |     640 |     1024 | 5:8            |
+|  4 |     512 |      768 | 2:3            |
+|  5 |     640 |      896 | 5:7            |
+|  6 |     576 |      768 | 3:4            |
+|  7 |     512 |      640 | 4:5            |
+|  8 |     640 |      768 | 5:6            |
+|  9 |     640 |      704 | 10:11          |
+| 10 |     512 |      512 | 1:1            |
+| 11 |     704 |      640 | 11:10          |
+| 12 |     768 |      640 | 6:5            |
+| 13 |     640 |      512 | 5:4            |
+| 14 |     768 |      576 | 4:3            |
+| 15 |     896 |      640 | 7:5            |
+| 16 |     768 |      512 | 3:2            |
+| 17 |    1024 |      640 | 8:5            |
+| 18 |     960 |      576 | 5:3            |
+| 19 |    1024 |      576 | 16:9           |
+| 20 |    1024 |      512 | 2:1            |
+- **Developed by:** Jonathan Chang
+- **Model type:** Diffusion-based text-to-image generation model
+- **Language(s)**: English
+- **License:** creativeml-openrail-m
+- **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1
+- **Resources for more information:** More information needed
+# Uses
+- see https://huggingface.co/stabilityai/stable-diffusion-2-1
+# Training Details
+## Training Data
+- LAION aesthetic dataset, subset of it with 6+ rating
+  - https://laion.ai/blog/laion-aesthetics/
+  - https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
+- I only used a small portion of that, see [Preprocessing](#preprocessing)
+- most common aspect ratios in the dataset (before preprocessing)
+|    | aspect_ratio   |   counts |
+|---:|:---------------|---------:|
+|  0 | 1:1            |   154727 |
+|  1 | 3:2            |   119615 |
+|  2 | 2:3            |    61197 |
+|  3 | 4:3            |    52276 |
+|  4 | 16:9           |    38862 |
+|  5 | 400:267        |    21893 |
+|  6 | 3:4            |    16893 |
+|  7 | 8:5            |    16258 |
+|  8 | 4:5            |    15684 |
+|  9 | 6:5            |    12228 |
+| 10 | 1000:667       |    12097 |
+| 11 | 2:1            |    11006 |
+| 12 | 800:533        |    10259 |
+| 13 | 5:4            |     9753 |
+| 14 | 500:333        |     9700 |
+| 15 | 250:167        |     9114 |
+| 16 | 5:3            |     8460 |
+| 17 | 200:133        |     7832 |
+| 18 | 1024:683       |     7176 |
+| 19 | 11:10          |     6470 |
+- predefined aspect ratios
+|    |   width |   height | aspect ratio   |
+|---:|--------:|---------:|:---------------|
+|  0 |     512 |     1024 | 1:2            |
+|  1 |     576 |     1024 | 9:16           |
+|  2 |     576 |      960 | 3:5            |
+|  3 |     640 |     1024 | 5:8            |
+|  4 |     512 |      768 | 2:3            |
+|  5 |     640 |      896 | 5:7            |
+|  6 |     576 |      768 | 3:4            |
+|  7 |     512 |      640 | 4:5            |
+|  8 |     640 |      768 | 5:6            |
+|  9 |     640 |      704 | 10:11          |
+| 10 |     512 |      512 | 1:1            |
+| 11 |     704 |      640 | 11:10          |
+| 12 |     768 |      640 | 6:5            |
+| 13 |     640 |      512 | 5:4            |
+| 14 |     768 |      576 | 4:3            |
+| 15 |     896 |      640 | 7:5            |
+| 16 |     768 |      512 | 3:2            |
+| 17 |    1024 |      640 | 8:5            |
+| 18 |     960 |      576 | 5:3            |
+| 19 |    1024 |      576 | 16:9           |
+| 20 |    1024 |      512 | 2:1            |
+## Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+### Preprocessing
+1. download files with url &amp; caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
+- I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet`
+2. use img2dataset to convert to webdataset
+    - https://github.com/rom1504/img2dataset
+    - I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file`
+    - the output folder is `/mnt/aesthetics6plus`, change this to your own folder
+```bash
+echo INPUT_FOLDER=first-file
+echo OUTPUT_FOLDER=/mnt/aesthetics6plus
+img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
+        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
+        --output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
+        --save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
+```
+3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:
+- use webdataset to load the data
+- calculate the aspect ratio of each image
+- find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
+- keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
+- random crop the image to the associated resolution. E.g. crop to 512x1024
+- if more than 10% of the image is lost in the cropping, discard this example.
+- batch examples by aspect ratio, so all examples in a batch have the same aspect ratio
+### Speeds, Sizes, Times
+- Dataset size: 100k image-caption pairs, before filtering.
+  - I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger.
+- Hardware: 1 RTX3090 GPUs
+- Optimizer: 8bit Adam
+- Batch size: 32
+  - actual batch size: 2
+  - gradient_accumulation_steps: 16
+  - effective batch size: 32
+- Learning rate: warmup to 2e-6 for 500 steps and then kept constant
+- Learning rate: 2e-6
+- Training steps: 6k
+- Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
+  - Each example is seen 1.92 times on average.
+- Training time: approximately 1 day
+## Results
+More information needed
+# Model Card Authors
+Jonathan Chang
+# How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel
+def use_DPM_solver(pipe):
+    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+    return pipe
+pipe = StableDiffusionPipeline.from_pretrained(
+  "stabilityai/stable-diffusion-2-1",
+    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
+    torch_dtype=torch.float16,
+    )
+# for v2-base, use the following line instead
+#pipe = StableDiffusionPipeline.from_pretrained(
+#  "stabilityai/stable-diffusion-2-base",
+#    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
+#    torch_dtype=torch.float16)
+pipe = use_DPM_solver(pipe).to("cuda")
+pipe = pipe.to("cuda")
+prompt = "a professional photograph of an astronaut riding a horse"
+image = pipe(prompt).images[0]
+image.save("astronaut_rides_horse.png")
+```