File size: 7,232 Bytes

---
pipeline_tag: text-to-image
license: other
license_name: stable-cascade-nc-community
license_link: LICENSE
---

# SoteDiffusion Cascade

Anime finetune of Stable Cascade.  
Currently is in very early state in training.  
No commercial use thanks to StabilityAI.  

<style>
.image {
    float: left;
    margin-left: 10px;
}
</style>

<table>
<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/VmLquGa3lkMXBTZ-j1QCj.png" width="320">
<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/QgBBkTJ3XeUf6bc_NiJ_r.png" width="320">
</table>

## Code Example

```shell
pip install diffusers
```

```python
import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "newest, 1girl, solo, cat ears, looking at viewer, blush, light smile,"
negative_prompt = "very displeasing, worst quality, monochrome, sketch, fat, child,"

prior = StableCascadePriorPipeline.from_pretrained("Disty0/sote-diffusion-cascade_alpha0", torch_dtype=torch.float16)
decoder = StableCascadeDecoderPipeline.from_pretrained("Disty0/sote-diffusion-cascade-decoder_alpha0", torch_dtype=torch.float16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=7.0,
    num_images_per_prompt=1,
    num_inference_steps=40
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings,
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=1.5
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")
```


## Training Status:

**Alpha0 Release**: This release resets the training and enables Text Encoder training.  


**GPU used for training**: 1x AMD RX 7900 XTX 24GB  

| dataset name | training done | remaining |
|---|---|---|
| **newest** | 000 | 230 |
| **recent** | 000 | 206 |
| **mid** | 000 | 201 |
| **early** | 000 | 055 |
| **oldest** | 000 | 016 |
| **pixiv** | 000 | 074 |
| **visual novel cg** | 000 | 070 |
| **anime wallpaper** | 000 | 013 |
| **Total** | 8 | 865 |

**Note**: chunks starts from 0 and there are 8000 images per chunk  


## Dataset:

**GPU used for captioning**: 1x Intel ARC A770 16GB  
**Model used for captioning**: SmilingWolf/wd-swinv2-tagger-v3  
**Command:**  
```
python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./
```


| dataset name | total images | total chunk |
|---|---|---|
| **newest** | 1.843.053 | 221 |
| **recent** | 1.652.420 | 207 |
| **mid** | 1.609.608 | 202 |
| **early** | 442.368 | 056 |
| **oldest** | 128.311 | 017 |
| **pixiv** | 594.046 | 075 |
| **visual novel cg** | 560.903 | 071 |
| **anime wallpaper** | 106.882 | 014 |
| **Total** | 6.937.591 | 873 |

**Note**: Smallest size is 1280x600 | 768.000 pixels


## Tags:

```
aesthetic tags, quality tags, date tags, custom tags, rating tags, character tags, rest of the tags
```

### Date:
| tag | date |
|---|---|
| **newest** | 2022 to 2024 |
| **recent** | 2019 to 2021 |
| **mid** | 2015 to 2018 |
| **early** | 2011 to 2014 |
| **oldest** | 2005 to 2010 |

### Aesthetic Tags:

**Model used**: shadowlilac/aesthetic-shadow-v2

| score greater than | tag |
|---|---|
| **0.90** | extremely aesthetic |
| **0.80** | very aesthetic |
| **0.70** | aesthetic |
| **0.50** | slightly aesthetic |
| **0.40** | not displeasing |
| **0.30** | not aesthetic |
| **0.20** | slightly displeasing |
| **0.10** | displeasing |
| **rest of them** | very displeasing |

### Quality Tags:

**Model used**: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth


| score greater than | tag |
|---|---|
| **0.980** | best quality |
| **0.900** | high quality |
| **0.750** | great quality |
| **0.500** | medium quality |
| **0.250** | normal quality |
| **0.125** | bad quality |
| **0.025** | low quality |
| **rest of them** | worst quality |

## Rating Tags
- general
- sensitive
- questionable
- explicit

## Custom Tags:

| dataset name | custom tag |
|---|---|
| **booru** | date, |
| **pixiv** | art by Display_Name, |
| **visual novel cg** | Full_VN_Name (short_3_letter_name), visual novel cg, |
| **anime wallpaper** | date, anime wallpaper, |

## Training Params:

**Software used**: Kohya SD-Scripts with Stable Cascade branch  
**Base model**: Disty0/sote-diffusion-cascade_pre-alpha0  

### Command:
```
accelerate launch  --mixed_precision fp16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py \
--mixed_precision fp16 \
--save_precision fp16 \
--full_fp16 \
--sdpa \
--gradient_checkpointing \
--train_text_encoder \
--resolution "1024,1024" \
--train_batch_size 2 \
--adaptive_loss_weight \
--learning_rate 4e-6 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--optimizer_type adafactor \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--max_grad_norm 0 \
--token_warmup_min 1 \
--token_warmup_step 0 \
--shuffle_caption \
--caption_dropout_rate 0 \
--caption_tag_dropout_rate 0 \
--caption_dropout_every_n_epochs 0 \
--dataset_repeats 1 \
--save_state \
--save_every_n_steps 2048 \
--sample_every_n_steps 512 \
--max_token_length 225 \
--max_train_epochs 1 \
--caption_extension ".txt" \
--max_data_loader_n_workers 2 \
--persistent_data_loader_workers \
--enable_bucket \
--min_bucket_reso 256 \
--max_bucket_reso 4096 \
--bucket_reso_steps 64 \
--bucket_no_upscale \
--log_with tensorboard \
--output_name sotediffusion-sc_3b \
--train_data_dir /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0000 \
--in_json /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0000.json \
--output_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-0 \
--logging_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-0/logs \
--resume /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480-state \
--stage_c_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480.safetensors \
--text_model_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480_text_model.safetensors \
--effnet_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/effnet_encoder.safetensors \
--previewer_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/previewer.safetensors \
--sample_prompts /mnt/DataSSD/AI/SoteDiffusion/StableCascade/config/sotediffusion-prompt.txt
```


## Limitations and Bias

### Bias

- This model is intended for anime illustrations.  
  Realistic capabilites are not tested at all.  
- Still underbaked.  

### Limitations
- Can fall back to realistic.  
  Add "realistic" tag to the negatives when this happens.  
- Far shot eyes are still bad thanks to the heavy latent compression.