Text-to-Image
File size: 3,004 Bytes
ee7c48b
39dc4f4
 
 
465bcbc
ee7c48b
 
1c3f470
11ec5ff
c845412
11ec5ff
0ead66e
11ec5ff
1c3f470
508ed68
1c3f470
 
 
 
 
 
ee7c48b
 
 
 
 
11ec5ff
 
 
 
ee7c48b
 
17adac7
9612229
17adac7
 
 
9612229
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: other
license_name: flux-1-dev-non-commercial-license
license_link: LICENSE.md
pipeline_tag: text-to-image
---

# Flux-Mini

A 3.2B MMDiT distilled from Flux-dev for efficient text-to-image generation

github: https://github.com/TencentARC/FluxKits

<div align="center">
<img src="flux_distill-flux-mini-teaser.jpg" width="800" alt="Teaser image">
</div>





Nowadays, text-to-image (T2I) models are growing stronger but larger, which limits their practical applicability, especially on consumer-level devices. 
To bridge this gap, we distilled the **12B** `Flux-dev` model into a **3.2B** `Flux-mini` model, trying to preserve its strong image generation capabilities. 
Specifically, we prune the original `Flux-dev` by reducing its depth from `19 + 38` (number of double blocks and single blocks) to `5 + 10`. 
The pruned model is further tuned with denoising and feature alignment objectives on a curated image-text dataset.

We empirically found that different blocks have different impacts on the generation quality, thus we initialize the student model with several most important blocks. 
The distillation process consists of three objectives: the denoise loss, the output alignment loss as well as the feature alignment loss. 
The feature alignment loss is designed in a way such that the output of `block-x` in the student model is encouraged to match that of `block-4x` in the teacher model. 
The distillation process is performed with `512x512` Laion images recaptioned with `Qwen-VL` in the first stage for `90k steps`, 
and `1024x1024` images generated by `Flux` using the prompts in `JourneyDB` with another `90k steps`.

## Limitations

With limited computing and data resources, the capability of our Flux-mini is still limited in certain domains. 
To facilitate the development of flux-based models, we open-sourced the codes to distill Flux in [this link](https://github.com/TencentARC/FluxKits). 
We appeal people interested in this project to collaborate together to build a more applicable and powerful text-to-image model!  

The current model is ok with generating common images such as human/animal faces, landscapes, fantasy and abstract scenes.  
Unfortunately, it is still incompetent in many scenarios. Including but not limited to:
* Fine-grained details, such as texts, human/animal structures,
* Perspective and Geometric Structure
* Dynamics and Motion
* Commonsense knowledge, e.g., brand logo
* Physical Plausibility
* Cultural Diversity

Since our model is trained with prompts in JourneyDB, we encourage users to apply our model with similar prompt formats (compositions of nouns and adjectives) to achieve the best quality. 
For example: "profile of sad Socrates, full body, high detail, dramatic scene, Epic dynamic action, wide angle, cinematic, hyper-realistic, concept art, warm muted tones as painted by Bernie Wrightson, Frank Frazetta."

Thank you for your attention! We will continue to improve our model and release new versions in the future.