sote-diffusion-cascade_alpha0 / README.md

Update README.md

8c6f9ff verified 8 months ago

7.23 kB

	---
	pipeline_tag: text-to-image
	license: other
	license_name: stable-cascade-nc-community
	license_link: LICENSE
	---

	# SoteDiffusion Cascade

	Anime finetune of Stable Cascade.
	Currently is in very early state in training.
	No commercial use thanks to StabilityAI.

	<style>
	.image {
	float: left;
	margin-left: 10px;
	}
	</style>

	<table>
	<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/VmLquGa3lkMXBTZ-j1QCj.png" width="320">
	<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/QgBBkTJ3XeUf6bc_NiJ_r.png" width="320">
	</table>

	## Code Example

	```shell
	pip install diffusers
	```

	```python
	import torch
	from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

	prompt = "newest, 1girl, solo, cat ears, looking at viewer, blush, light smile,"
	negative_prompt = "very displeasing, worst quality, monochrome, sketch, fat, child,"

	prior = StableCascadePriorPipeline.from_pretrained("Disty0/sote-diffusion-cascade_alpha0", torch_dtype=torch.float16)
	decoder = StableCascadeDecoderPipeline.from_pretrained("Disty0/sote-diffusion-cascade-decoder_alpha0", torch_dtype=torch.float16)

	prior.enable_model_cpu_offload()
	prior_output = prior(
	prompt=prompt,
	height=1024,
	width=1024,
	negative_prompt=negative_prompt,
	guidance_scale=7.0,
	num_images_per_prompt=1,
	num_inference_steps=40
	)

	decoder.enable_model_cpu_offload()
	decoder_output = decoder(
	image_embeddings=prior_output.image_embeddings,
	prompt=prompt,
	negative_prompt=negative_prompt,
	guidance_scale=1.5
	output_type="pil",
	num_inference_steps=10
	).images[0]
	decoder_output.save("cascade.png")
	```


	## Training Status:

	Alpha0 Release: This release resets the training and enables Text Encoder training.


	GPU used for training: 1x AMD RX 7900 XTX 24GB

	\| dataset name \| training done \| remaining \|
	\|---\|---\|---\|
	\| newest \| 000 \| 230 \|
	\| recent \| 000 \| 206 \|
	\| mid \| 000 \| 201 \|
	\| early \| 000 \| 055 \|
	\| oldest \| 000 \| 016 \|
	\| pixiv \| 000 \| 074 \|
	\| visual novel cg \| 000 \| 070 \|
	\| anime wallpaper \| 000 \| 013 \|
	\| Total \| 8 \| 865 \|

	Note: chunks starts from 0 and there are 8000 images per chunk


	## Dataset:

	GPU used for captioning: 1x Intel ARC A770 16GB
	Model used for captioning: SmilingWolf/wd-swinv2-tagger-v3
	Command:
	```
	python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./
	```


	\| dataset name \| total images \| total chunk \|
	\|---\|---\|---\|
	\| newest \| 1.843.053 \| 221 \|
	\| recent \| 1.652.420 \| 207 \|
	\| mid \| 1.609.608 \| 202 \|
	\| early \| 442.368 \| 056 \|
	\| oldest \| 128.311 \| 017 \|
	\| pixiv \| 594.046 \| 075 \|
	\| visual novel cg \| 560.903 \| 071 \|
	\| anime wallpaper \| 106.882 \| 014 \|
	\| Total \| 6.937.591 \| 873 \|

	Note: Smallest size is 1280x600 \| 768.000 pixels


	## Tags:

	```
	aesthetic tags, quality tags, date tags, custom tags, rating tags, character tags, rest of the tags
	```

	### Date:
	\| tag \| date \|
	\|---\|---\|
	\| newest \| 2022 to 2024 \|
	\| recent \| 2019 to 2021 \|
	\| mid \| 2015 to 2018 \|
	\| early \| 2011 to 2014 \|
	\| oldest \| 2005 to 2010 \|

	### Aesthetic Tags:

	Model used: shadowlilac/aesthetic-shadow-v2

	\| score greater than \| tag \|
	\|---\|---\|
	\| 0.90 \| extremely aesthetic \|
	\| 0.80 \| very aesthetic \|
	\| 0.70 \| aesthetic \|
	\| 0.50 \| slightly aesthetic \|
	\| 0.40 \| not displeasing \|
	\| 0.30 \| not aesthetic \|
	\| 0.20 \| slightly displeasing \|
	\| 0.10 \| displeasing \|
	\| rest of them \| very displeasing \|

	### Quality Tags:

	Model used: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth


	\| score greater than \| tag \|
	\|---\|---\|
	\| 0.980 \| best quality \|
	\| 0.900 \| high quality \|
	\| 0.750 \| great quality \|
	\| 0.500 \| medium quality \|
	\| 0.250 \| normal quality \|
	\| 0.125 \| bad quality \|
	\| 0.025 \| low quality \|
	\| rest of them \| worst quality \|

	## Rating Tags
	- general
	- sensitive
	- questionable
	- explicit

	## Custom Tags:

	\| dataset name \| custom tag \|
	\|---\|---\|
	\| booru \| date, \|
	\| pixiv \| art by Display_Name, \|
	\| visual novel cg \| Full_VN_Name (short_3_letter_name), visual novel cg, \|
	\| anime wallpaper \| date, anime wallpaper, \|

	## Training Params:

	Software used: Kohya SD-Scripts with Stable Cascade branch
	Base model: Disty0/sote-diffusion-cascade_pre-alpha0

	### Command:
	```
	accelerate launch --mixed_precision fp16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py \
	--mixed_precision fp16 \
	--save_precision fp16 \
	--full_fp16 \
	--sdpa \
	--gradient_checkpointing \
	--train_text_encoder \
	--resolution "1024,1024" \
	--train_batch_size 2 \
	--adaptive_loss_weight \
	--learning_rate 4e-6 \
	--lr_scheduler constant_with_warmup \
	--lr_warmup_steps 100 \
	--optimizer_type adafactor \
	--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
	--max_grad_norm 0 \
	--token_warmup_min 1 \
	--token_warmup_step 0 \
	--shuffle_caption \
	--caption_dropout_rate 0 \
	--caption_tag_dropout_rate 0 \
	--caption_dropout_every_n_epochs 0 \
	--dataset_repeats 1 \
	--save_state \
	--save_every_n_steps 2048 \
	--sample_every_n_steps 512 \
	--max_token_length 225 \
	--max_train_epochs 1 \
	--caption_extension ".txt" \
	--max_data_loader_n_workers 2 \
	--persistent_data_loader_workers \
	--enable_bucket \
	--min_bucket_reso 256 \
	--max_bucket_reso 4096 \
	--bucket_reso_steps 64 \
	--bucket_no_upscale \
	--log_with tensorboard \
	--output_name sotediffusion-sc_3b \
	--train_data_dir /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0000 \
	--in_json /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0000.json \
	--output_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-0 \
	--logging_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-0/logs \
	--resume /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480-state \
	--stage_c_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480.safetensors \
	--text_model_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480_text_model.safetensors \
	--effnet_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/effnet_encoder.safetensors \
	--previewer_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/previewer.safetensors \
	--sample_prompts /mnt/DataSSD/AI/SoteDiffusion/StableCascade/config/sotediffusion-prompt.txt
	```


	## Limitations and Bias

	### Bias

	- This model is intended for anime illustrations.
	Realistic capabilites are not tested at all.
	- Still underbaked.

	### Limitations
	- Can fall back to realistic.
	Add "realistic" tag to the negatives when this happens.
	- Far shot eyes are still bad thanks to the heavy latent compression.