Text-to-Image
ZyZcuhk commited on
Commit
ecd985f
β€’
1 Parent(s): f090acd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -18
README.md CHANGED
@@ -5,11 +5,20 @@ license_link: LICENSE.md
5
  pipeline_tag: text-to-image
6
  ---
7
 
8
- # Flux-Mini
 
9
 
10
  A 3.2B MMDiT distilled from Flux-dev for efficient text-to-image generation
11
 
12
- github: https://github.com/TencentARC/FluxKits
 
 
 
 
 
 
 
 
13
 
14
  <div align="center">
15
  <img src="flux_distill-flux-mini-teaser.jpg" width="800" alt="Teaser image">
@@ -18,34 +27,85 @@ github: https://github.com/TencentARC/FluxKits
18
 
19
 
20
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- Nowadays, text-to-image (T2I) models are growing stronger but larger, which limits their practical applicability, especially on consumer-level devices.
23
- To bridge this gap, we distilled the **12B** `Flux-dev` model into a **3.2B** `Flux-mini` model, trying to preserve its strong image generation capabilities.
24
- Specifically, we prune the original `Flux-dev` by reducing its depth from `19 + 38` (number of double blocks and single blocks) to `5 + 10`.
25
- The pruned model is further tuned with denoising and feature alignment objectives on a curated image-text dataset.
26
 
27
- We empirically found that different blocks have different impacts on the generation quality, thus we initialize the student model with several most important blocks.
28
- The distillation process consists of three objectives: the denoise loss, the output alignment loss as well as the feature alignment loss.
29
- The feature alignment loss is designed in a way such that the output of `block-x` in the student model is encouraged to match that of `block-4x` in the teacher model.
30
- The distillation process is performed with `512x512` Laion images recaptioned with `Qwen-VL` in the first stage for `90k steps`,
31
- and `1024x1024` images generated by `Flux-schnell` using the prompts in `JourneyDB` with another `90k steps`.
 
 
 
 
 
32
 
33
- ## Limitations
34
 
35
- With limited computing and data resources, the capability of our Flux-mini is still limited in certain domains.
36
- To facilitate the development of flux-based models, we open-sourced the codes to distill Flux in [this link](https://github.com/TencentARC/FluxKits).
37
- We appeal people interested in this project to collaborate together to build a more applicable and powerful text-to-image model!
38
 
39
  The current model is ok with generating common images such as human/animal faces, landscapes, fantasy and abstract scenes.
40
  Unfortunately, it is still incompetent in many scenarios. Including but not limited to:
41
- * Fine-grained details, such as texts, human/animal structures,
 
42
  * Perspective and Geometric Structure
43
  * Dynamics and Motion
44
  * Commonsense knowledge, e.g., brand logo
45
  * Physical Plausibility
46
  * Cultural Diversity
47
 
48
- Since our model is trained with prompts in JourneyDB, we encourage users to apply our model with similar prompt formats (compositions of nouns and adjectives) to achieve the best quality.
49
  For example: "profile of sad Socrates, full body, high detail, dramatic scene, Epic dynamic action, wide angle, cinematic, hyper-realistic, concept art, warm muted tones as painted by Bernie Wrightson, Frank Frazetta."
50
 
51
- Thank you for your attention! We will continue to improve our model and release new versions in the future.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  pipeline_tag: text-to-image
6
  ---
7
 
8
+
9
+ ## πŸ”₯ Flux-Mini
10
 
11
  A 3.2B MMDiT distilled from Flux-dev for efficient text-to-image generation
12
 
13
+
14
+ <div align="center">
15
+
16
+ [![Static Badge](https://img.shields.io/badge/Github-Repo-blue)](https://github.com/TencentARC/FluxKits)
17
+ [![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/flux-mini)
18
+ [![Static Badge](https://img.shields.io/badge/%F0%9F%A4%97%20Gradio%20Demo-Huggingface-orange)](https://huggingface.co/spaces/TencentARC/Flux-Mini)
19
+
20
+ </div>
21
+
22
 
23
  <div align="center">
24
  <img src="flux_distill-flux-mini-teaser.jpg" width="800" alt="Teaser image">
 
27
 
28
 
29
 
30
+ Nowadays, text-to-image (T2I) models are growing stronger but larger, which limits their pratical applicability, especially on consumer-level devices. To bridge this gap, we distilled the **12B** `Flux-dev` model into a **3.2B** `Flux-mini` model, and trying to preserve its strong image generation capabilities. Specifically, we prune the original `Flux-dev` by reducing its depth from `19 + 38` (number of double blocks and single blocks) to `5 + 10`. The pruned model is further tuned with denosing and feature alignment objectives on a curated image-text dataset.
31
+
32
+ πŸ”₯πŸ”₯ Nonetheless, with limited computing and data resources, the capability of our Flux-mini is still limited in certain domains. To facilitate the development of flux-based models, we open-sourced the codes to distill Flux in [this folder](./flux-npu/). **We appeal people interested in this project to collaborate together to build a more applicable and powerful text-to-image model!**
33
+
34
+
35
+ ### ⏰ Timeline
36
+
37
+ **[2024.11.26]** We are delighted to release the first version of Flux-Mini!
38
+
39
+
40
+ ### ⚑️ Efficiency Comparison
41
+ We compared our Flux-Mini with Flux-Dev on `a single H20 GPU` with `BF16` precision, with `batch-size=1`, `deepspeed stage=2`, `gradient_checkpoint=True`. For inference, we adopt `num_steps=50`. The costs of T5, CLIP and VAE are included. `OOM` means out-of-memory.
42
 
 
 
 
 
43
 
44
+ | Resolution | Training Strategy | Model | Training Speed (s/img) | Training Memory (GB) | Inference Speed (s/img) | Inference Memory (GB) |
45
+ |-------|------|---------|---------|---------|---------|---------|
46
+ | 512 | LoRA(r=16) | Flux-dev | 1.10 | 35.91 | 11.11 | 35.11 |
47
+ | 512 | LoRA(r=16) | Flux-Mini | 0.33 | 19.06 | 3.07 | 18.49 |
48
+ | 512 | Fully Finetune | Flux-dev | OOM | OOM | 11.11 | 35.11 |
49
+ | 512 | Fully Finetune | Flux-Mini | 0.57 | 83.7 | 3.07 | 18.49 |
50
+ | 1024 | LoRA(r=16) | Flux-dev | 2.93 | 38.03 | 38.26 | 42.24 |
51
+ | 1024 | LoRA(r=16) | Flux-Mini | 1.05 | 22.21 | 10.31 | 25.61 |
52
+ | 1024 | Fully Finetune | Flux-dev | OOM | OOM | 38.26 | 42.24 |
53
+ | 1024 | Fully Finetune | Flux-Mini | 1.30 | 83.71 | 10.31 | 25.61 |
54
 
 
55
 
56
+ ### β›… Limitations
57
+ Compared with advanced text-to-image models, our model was trained with limited computing resources and synthetic data with mediocre quality.
58
+ Thus, the generation capability of our model is still limited in certain categories.
59
 
60
  The current model is ok with generating common images such as human/animal faces, landscapes, fantasy and abstract scenes.
61
  Unfortunately, it is still incompetent in many scenarios. Including but not limited to:
62
+ * Fine-grained details, such as human and animal structures
63
+ * Typography
64
  * Perspective and Geometric Structure
65
  * Dynamics and Motion
66
  * Commonsense knowledge, e.g., brand logo
67
  * Physical Plausibility
68
  * Cultural Diversity
69
 
70
+ Since our model is trained with prompts in JourneyDB, we encourage users to apply our model with **similar prompt formats** (compositions of nouns and adjectives) to achieve the best quality.
71
  For example: "profile of sad Socrates, full body, high detail, dramatic scene, Epic dynamic action, wide angle, cinematic, hyper-realistic, concept art, warm muted tones as painted by Bernie Wrightson, Frank Frazetta."
72
 
73
+
74
+ We welcome everyone in the community of collaborate and PR for this model.
75
+
76
+
77
+ ## πŸ’» Flux-NPU
78
+
79
+ The widespread development of NPUs has provided extra device options for model training and inference. To facilitate the usage of flux, We provide a codebase that could run the training and inference code of FLUX on NPUs.
80
+
81
+ Please find out more details in [this folder](https://github.com/TencentARC/FluxKits).
82
+
83
+ ### ⚑️ Efficiency Comparison on NPU.
84
+ We compared our Flux-Mini with Flux-Dev on a single `Ascend 910B NPU` with `BF16` precision, with `batch-size=1`, `deepspeed stage=2`, `gradient_checkpoint=True`. For inference, we adopt `num_steps=50`. The costs of T5, CLIP and VAE are included. `OOM` means out-of-memory.
85
+
86
+
87
+ | Resolution | Training Strategy | Model | Training Speed (s/img) | Training Memory (GB) | Inference Speed (s/img) | Inference Memory (GB) |
88
+ |-------|------|---------|---------|---------|---------|---------|
89
+ | 512 | LoRA(r=16) | Flux-dev | 1.07 | 38.45 | 11.00 | 58.62 |
90
+ | 512 | LoRA(r=16) | Flux-Mini | 0.37 | 20.64 | 3.26 | 19.48 |
91
+ | 512 | Fully Finetune | Flux-dev | OOM | OOM | 11.00 | 58.62 |
92
+ | 512 | Fully Finetune | Flux-Mini | OOM | OOM | 3.26 | 19.48 |
93
+ | 1024 | LoRA(r=16) | Flux-dev | 3.01 | 44.69 | OOM | OOM |
94
+ | 1024 | LoRA(r=16) | Flux-Mini | 1.06 | 25.84 | 10.60 | 27.76 |
95
+ | 1024 | Fully Finetune | Flux-dev | OOM | OOM | OOM | OOM |
96
+ | 1024 | Fully Finetune | Flux-Mini | OOM | OOM | 10.60 | 27.76 |
97
+
98
+ ## 🐾 Disclaimer
99
+ Users are granted the freedom to create images using our model and tools, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.
100
+
101
+ ## πŸ‘ Acknowledgements
102
+ We thank the authors of the following repos for their excellent contribution!
103
+
104
+ - [Flux](https://github.com/black-forest-labs/flux)
105
+ - [x-flux](https://github.com/XLabs-AI/x-flux)
106
+ - [MLLM-NPU](https://github.com/TencentARC/mllm-npu)
107
+
108
+ ## πŸ”Ž License
109
+ Our Flux-mini model weights follows the liscence of [Flux-Dev non-commercial License](https://github.com/black-forest-labs/flux/blob/main/model_licenses/LICENSE-FLUX1-dev).
110
+
111
+ The other codes follow the Apache-2.0 License.