|
--- |
|
license: other |
|
license_name: flux-1-dev-non-commercial-license |
|
license_link: LICENSE.md |
|
pipeline_tag: text-to-image |
|
--- |
|
|
|
# Flux-Mini |
|
|
|
A 3.2B MMDiT distilled from Flux-dev for efficient text-to-image generation |
|
|
|
github: https://github.com/TencentARC/FluxKits |
|
|
|
<div align="center"> |
|
<img src="flux_distill-flux-mini-teaser.jpg" width="800" alt="Teaser image"> |
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
Nowadays, text-to-image (T2I) models are growing stronger but larger, which limits their practical applicability, especially on consumer-level devices. |
|
To bridge this gap, we distilled the **12B** `Flux-dev` model into a **3.2B** `Flux-mini` model, trying to preserve its strong image generation capabilities. |
|
Specifically, we prune the original `Flux-dev` by reducing its depth from `19 + 38` (number of double blocks and single blocks) to `5 + 10`. |
|
The pruned model is further tuned with denoising and feature alignment objectives on a curated image-text dataset. |
|
|
|
We empirically found that different blocks have different impacts on the generation quality, thus we initialize the student model with several most important blocks. |
|
The distillation process consists of three objectives: the denoise loss, the output alignment loss as well as the feature alignment loss. |
|
The feature alignment loss is designed in a way such that the output of `block-x` in the student model is encouraged to match that of `block-4x` in the teacher model. |
|
The distillation process is performed with `512x512` Laion images recaptioned with `Qwen-VL` in the first stage for `90k steps`, |
|
and `1024x1024` images generated by `Flux` using the prompts in `JourneyDB` with another `90k steps`. |
|
|
|
## Limitations |
|
|
|
With limited computing and data resources, the capability of our Flux-mini is still limited in certain domains. |
|
To facilitate the development of flux-based models, we open-sourced the codes to distill Flux in [this link](https://github.com/TencentARC/FluxKits). |
|
We appeal people interested in this project to collaborate together to build a more applicable and powerful text-to-image model! |
|
|
|
The current model is ok with generating common images such as human/animal faces, landscapes, fantasy and abstract scenes. |
|
Unfortunately, it is still incompetent in many scenarios. Including but not limited to: |
|
* Fine-grained details, such as texts, human/animal structures, |
|
* Perspective and Geometric Structure |
|
* Dynamics and Motion |
|
* Commonsense knowledge, e.g., brand logo |
|
* Physical Plausibility |
|
* Cultural Diversity |
|
|
|
Since our model is trained with prompts in JourneyDB, we encourage users to apply our model with similar prompt formats (compositions of nouns and adjectives) to achieve the best quality. |
|
For example: "profile of sad Socrates, full body, high detail, dramatic scene, Epic dynamic action, wide angle, cinematic, hyper-realistic, concept art, warm muted tones as painted by Bernie Wrightson, Frank Frazetta." |
|
|
|
Thank you for your attention! We will continue to improve our model and release new versions in the future. |
|
|