|
--- |
|
license: agpl-3.0 |
|
--- |
|
|
|
# SD1 Style Components (experimental) |
|
|
|
Style control for Stable Diffusion 1.x anime models |
|
|
|
## What is this? |
|
|
|
It is IP-Adapter, but for (anime) styles. Instead of CLIP image embeddings, the image generation is conditioned on 30-dimensional style embeddings, which can either be extracted from an image(s) or created manually. |
|
|
|
## Why? |
|
|
|
Currently, the main means of style control is through artist tags. This method reasonably raises the concern of style plagiarism. |
|
By breaking down styles into interpretable components that are present in all artists, direct copying of styles can be avoided. |
|
Furthermore, new styles can be easily created by manipulating the magnitude of the style components, offering more controllability over stacking artist tags or LoRAs. |
|
|
|
Additionally, this can be potentially useful for general purpose training, as training with style condition may weaken style leakage into concepts. |
|
This also serves as a demonstration that image models can be conditioned on arbitrary tensors other than text or images. |
|
Hopefully, more people can understand that it is not necessary to force conditions that are inherently numerical (aesthetic scores, dates, ...) into text form tags. |
|
|
|
## How do I use it? |
|
|
|
Currently, a [Colab notebook](https://colab.research.google.com/drive/1AKXiHHBAnzbtKyToN6WdzOov-niJudcL?usp=sharing) with a gradio interface is available. |
|
As this is only an experimental preview, proper support for popular web UIs will not be added before more the models reach a stable state. |
|
|
|
## Technical details |
|
First, a style embedding model is created by Supervised Contrastive Learning on an [artists dataset](https://huggingface.co/datasets/gustproof/artists/blob/main/artists.zip). |
|
Then, from the learned embeddings, the 30 first components of a PCA are extracted. Finally, a modified IP-Adapter is trained on anime-final-pruned using the same dataset with WD1.4 tags and the projected 30-d embeddings. The training resolution is 576*576 with variable aspect ratios. |
|
|
|
|
|
## Acknowledgements |
|
This is largely inspired by [Inserting Anybody in Diffusion Models via Celeb Basis](http://arxiv.org/abs/2306.00926) and [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter). Training and inference code is modified from IP-Adapter ([license](https://github.com/tencent-ailab/IP-Adapter/blob/main/LICENSE)). |