svjack's picture
Upload 1392 files
43b7e92 verified
|
raw
history blame
19 kB

Stable diffusion XL

Stable Diffusion XL์€ Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Mรผller, Joe Penna, Robin Rombach์— ์˜ํ•ด SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ์ดˆ๋ก์€ ๋‹ค์Œ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

text-to-image์˜ latent diffusion ๋ชจ๋ธ์ธ SDXL์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ๋ฒ„์ „์˜ Stable Diffusion๊ณผ ๋น„๊ตํ•˜๋ฉด, SDXL์€ ์„ธ ๋ฐฐ ๋”ํฐ ๊ทœ๋ชจ์˜ UNet ๋ฐฑ๋ณธ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค: ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ฆ๊ฐ€๋Š” ๋งŽ์€ attention ๋ธ”๋Ÿญ์„ ์‚ฌ์šฉํ•˜๊ณ  ๋” ํฐ cross-attention context๋ฅผ SDXL์˜ ๋‘ ๋ฒˆ์งธ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์— ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋‹ค์ค‘ ์ข…ํšก๋น„์— ๋‹ค์ˆ˜์˜ ์ƒˆ๋กœ์šด conditioning ๋ฐฉ๋ฒ•์„ ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ›„์— ์ˆ˜์ •ํ•˜๋Š” image-to-image ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ SDXL์— ์˜ํ•ด ์ƒ์„ฑ๋œ ์‹œ๊ฐ์  ํ’ˆ์งˆ์„ ํ–ฅ์ƒํ•˜๊ธฐ ์œ„ํ•ด ์ •์ œ๋œ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. SDXL์€ ์ด์ „ ๋ฒ„์ „์˜ Stable Diffusion๋ณด๋‹ค ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ๊ณ , ์ด๋Ÿฌํ•œ black-box ์ตœ์‹  ์ด๋ฏธ์ง€ ์ƒ์„ฑ์ž์™€ ๊ฒฝ์Ÿ๋ ฅ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

ํŒ

  • Stable Diffusion XL์€ ํŠนํžˆ 786๊ณผ 1024์‚ฌ์ด์˜ ์ด๋ฏธ์ง€์— ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
  • Stable Diffusion XL์€ ์•„๋ž˜์™€ ๊ฐ™์ด ํ•™์Šต๋œ ๊ฐ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์— ๋Œ€ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋™์ผํ•œ ํ”„๋กฌํ”„ํŠธ์˜ ๋‹ค๋ฅธ ๋ถ€๋ถ„์„ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์— ์ „๋‹ฌํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Stable Diffusion XL ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€๋Š” ์•„๋ž˜์— ๋ณด์—ฌ์ง€๋“ฏ์ด ์ •์ œ๊ธฐ(refiner)๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์šฉ๊ฐ€๋Šฅํ•œ ์ฒดํฌํฌ์ธํŠธ:

์‚ฌ์šฉ ์˜ˆ์‹œ

SDXL์„ ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— transformers, accelerate, safetensors ์™€ invisible_watermark๋ฅผ ์„ค์น˜ํ•˜์„ธ์š”. ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pip install transformers
pip install accelerate
pip install safetensors
pip install invisible-watermark>=0.2.0

์›Œํ„ฐ๋งˆ์ปค

Stable Diffusion XL๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ์›Œํ„ฐ๋งˆํฌ๊ฐ€ ๋ณด์ด์ง€ ์•Š๋„๋ก ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•˜๋Š”๋ฐ, ์ด๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ(downstream) ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๊ธฐ๊ณ„์— ํ•ฉ์„ฑ๋˜์—ˆ๋Š”์ง€๋ฅผ ์‹๋ณ„ํ•˜๋Š”๋ฐ ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•˜๋ ค๋ฉด invisible_watermark ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์„ค์น˜ํ•ด์ฃผ์„ธ์š”:

pip install invisible-watermark>=0.2.0

invisible-watermark ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜๋ฉด ์›Œํ„ฐ๋งˆ์ปค๊ฐ€ ๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ƒ์„ฑ ๋˜๋Š” ์•ˆ์ „ํ•˜๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ๋ฐฐํฌํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค๋ฅธ ๊ทœ์ •์ด ์žˆ๋‹ค๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์›Œํ„ฐ๋งˆ์ปค๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pipe = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)

Text-to-Image

text-to-image๋ฅผ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด SDXL์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt).images[0]

Image-to-image

image-to-image๋ฅผ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด SDXL์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

import torch
from diffusers import StableDiffusionXLImg2ImgPipeline
from diffusers.utils import load_image

pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe = pipe.to("cuda")
url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"

init_image = load_image(url).convert("RGB")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt, image=init_image).images[0]

์ธํŽ˜์ธํŒ…

inpainting๋ฅผ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด SDXL์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

import torch
from diffusers import StableDiffusionXLInpaintPipeline
from diffusers.utils import load_image

pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = load_image(img_url).convert("RGB")
mask_image = load_image(mask_url).convert("RGB")

prompt = "A majestic tiger sitting on a bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0]

์ด๋ฏธ์ง€ ๊ฒฐ๊ณผ๋ฌผ์„ ์ •์ œํ•˜๊ธฐ

base ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ์—์„œ, StableDiffusion-XL ๋˜ํ•œ ๊ณ ์ฃผํŒŒ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋‚ฎ์€ ๋…ธ์ด์ฆˆ ๋‹จ๊ณ„ ์ด๋ฏธ์ง€๋ฅผ ์ œ๊ฑฐํ•˜๋Š”๋ฐ ํŠนํ™”๋œ refiner ์ฒดํฌํฌ์ธํŠธ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด refiner ์ฒดํฌํฌ์ธํŠธ๋Š” ์ด๋ฏธ์ง€ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด base ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์‹คํ–‰ํ•œ ํ›„ "๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„" ํŒŒ์ดํ”„๋ผ์ธ์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

refiner๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ, ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

  • 1.) base ๋ชจ๋ธ๊ณผ refiner์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋Š” Denoisers์˜ ์•™์ƒ๋ธ”์„ ์œ„ํ•œ ์ฒซ ๋ฒˆ์งธ ์ œ์•ˆ๋œ eDiff-I๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜
  • 2.) base ๋ชจ๋ธ์„ ๊ฑฐ์นœ ํ›„ SDEdit ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹จ์ˆœํ•˜๊ฒŒ refiner๋ฅผ ์‹คํ–‰์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ : SD-XL base์™€ refiner๋ฅผ ์•™์ƒ๋ธ”๋กœ ์‚ฌ์šฉํ•˜๋Š” ์•„์ด๋””์–ด๋Š” ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ธฐ์—ฌ์ž๋“ค์ด ์ฒ˜์Œ์œผ๋กœ ์ œ์•ˆํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ diffusers๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐ๋„ ๋„์›€์„ ์ฃผ์…จ์Šต๋‹ˆ๋‹ค.

1.) Denoisers์˜ ์•™์ƒ๋ธ”

base์™€ refiner ๋ชจ๋ธ์„ denoiser์˜ ์•™์ƒ๋ธ”๋กœ ์‚ฌ์šฉํ•  ๋•Œ, base ๋ชจ๋ธ์€ ๊ณ ์ฃผํŒŒ diffusion ๋‹จ๊ณ„๋ฅผ ์œ„ํ•œ ์ „๋ฌธ๊ฐ€์˜ ์—ญํ• ์„ ํ•ด์•ผํ•˜๊ณ , refiner๋Š” ๋‚ฎ์€ ๋…ธ์ด์ฆˆ diffusion ๋‹จ๊ณ„๋ฅผ ์œ„ํ•œ ์ „๋ฌธ๊ฐ€์˜ ์—ญํ• ์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

2.)์— ๋น„ํ•ด 1.)์˜ ์žฅ์ ์€ ์ „์ฒด์ ์œผ๋กœ denoising ๋‹จ๊ณ„๊ฐ€ ๋œ ํ•„์š”ํ•˜๋ฏ€๋กœ ์†๋„๊ฐ€ ํ›จ์”ฌ ๋” ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค. ๋‹จ์ ์€ base ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฆ‰, ์—ฌ์ „ํžˆ ๋…ธ์ด์ฆˆ๊ฐ€ ์‹ฌํ•˜๊ฒŒ ์ œ๊ฑฐ๋ฉ๋‹ˆ๋‹ค.

base ๋ชจ๋ธ๊ณผ refiner๋ฅผ denoiser์˜ ์•™์ƒ๋ธ”๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ๊ฐ ๊ณ ๋…ธ์ด์ฆˆ(high-nosise) (์ฆ‰ base ๋ชจ๋ธ)์™€ ์ €๋…ธ์ด์ฆˆ (์ฆ‰ refiner ๋ชจ๋ธ)์˜ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณ์•ผํ•˜๋Š” ํƒ€์ž„์Šคํ…์˜ ๊ธฐ๊ฐ„์„ ์ •์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. base ๋ชจ๋ธ์˜ denoising_end์™€ refiner ๋ชจ๋ธ์˜ denoising_start๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ„๊ฒฉ์„ ์ •ํ•ฉ๋‹ˆ๋‹ค.

denoising_end์™€ denoising_start ๋ชจ๋‘ 0๊ณผ 1์‚ฌ์ด์˜ ์‹ค์ˆ˜ ๊ฐ’์œผ๋กœ ์ „๋‹ฌ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ „๋‹ฌ๋˜๋ฉด ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ์˜ ๋๊ณผ ์‹œ์ž‘์€ ๋ชจ๋ธ ์Šค์ผ€์ค„์— ์˜ํ•ด ์ •์˜๋œ ์ด์‚ฐ์ (discrete) ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ์˜ ๋น„์œจ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค. ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๋‹จ๊ณ„์˜ ์ˆ˜๋Š” ๋ชจ๋ธ์ด ํ•™์Šต๋œ ๋ถˆ์—ฐ์†์ ์ธ ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ๊ณผ ์„ ์–ธ๋œ fractional cutoff์— ์˜ํ•ด ๊ฒฐ์ •๋˜๋ฏ€๋กœ '๊ฐ•๋„' ๋˜ํ•œ ์„ ์–ธ๋œ ๊ฒฝ์šฐ ์ด ๊ฐ’์ด '๊ฐ•๋„'๋ฅผ ์žฌ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์šฐ์„ , ๋‘ ๊ฐœ์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ์ธ์ฝ”๋”์™€ variational autoencoder๋Š” ๋™์ผํ•˜๋ฏ€๋กœ refiner๋ฅผ ์œ„ํ•ด ๋‹ค์‹œ ๋ถˆ๋Ÿฌ์˜ค์ง€ ์•Š์•„๋„ ๋ฉ๋‹ˆ๋‹ค.

from diffusers import DiffusionPipeline
import torch

base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
refiner.to("cuda")

์ด์ œ ์ถ”๋ก  ๋‹จ๊ณ„์˜ ์ˆ˜์™€ ๊ณ ๋…ธ์ด์ฆˆ์—์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋‹จ๊ณ„(์ฆ‰ base ๋ชจ๋ธ)๋ฅผ ๊ฑฐ์ณ ์‹คํ–‰๋˜๋Š” ์ง€์ ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

n_steps = 40
high_noise_frac = 0.8

Stable Diffusion XL base ๋ชจ๋ธ์€ ํƒ€์ž„์Šคํ… 0-999์— ํ•™์Šต๋˜๋ฉฐ Stable Diffusion XL refiner๋Š” ํฌ๊ด„์ ์ธ ๋‚ฎ์€ ๋…ธ์ด์ฆˆ ํƒ€์ž„์Šคํ…์ธ 0-199์— base ๋ชจ๋ธ๋กœ ๋ถ€ํ„ฐ ํŒŒ์ธํŠœ๋‹๋˜์–ด, ์ฒซ 800 ํƒ€์ž„์Šคํ… (๋†’์€ ๋…ธ์ด์ฆˆ)์— base ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ  ๋งˆ์ง€๋ง‰ 200 ํƒ€์ž…์Šคํ… (๋‚ฎ์€ ๋…ธ์ด์ฆˆ)์—์„œ refiner๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, high_noise_frac๋Š” 0.8๋กœ ์„ค์ •ํ•˜๊ณ , ๋ชจ๋“  200-999 ์Šคํ…(๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ํƒ€์ž„์Šคํ…์˜ ์ฒซ 80%)์€ base ๋ชจ๋ธ์— ์˜ํ•ด ์ˆ˜ํ–‰๋˜๋ฉฐ 0-199 ์Šคํ…(๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ํƒ€์ž„์Šคํ…์˜ ๋งˆ์ง€๋ง‰ 20%)์€ refiner ๋ชจ๋ธ์— ์˜ํ•ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

๊ธฐ์–ตํ•˜์„ธ์š”, ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ์ ˆ์ฐจ๋Š” ๋†’์€ ๊ฐ’(๋†’์€ ๋…ธ์ด์ฆˆ) ํƒ€์ž„์Šคํ…์—์„œ ์‹œ์ž‘๋˜๊ณ , ๋‚ฎ์€ ๊ฐ’ (๋‚ฎ์€ ๋…ธ์ด์ฆˆ) ํƒ€์ž„์Šคํ…์—์„œ ๋๋‚ฉ๋‹ˆ๋‹ค.

์ด์ œ ๋‘ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‹คํ–‰ํ•ด๋ด…์‹œ๋‹ค. denoising_end๊ณผ denoising_start๋ฅผ ๊ฐ™์€ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๊ณ  num_inference_steps๋Š” ์ƒ์ˆ˜๋กœ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ base ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์€ ์ž ์žฌ ๊ณต๊ฐ„์— ์žˆ์–ด์•ผ ํ•œ๋‹ค๋Š” ์ ์„ ๊ธฐ์–ตํ•˜์„ธ์š”:

prompt = "A majestic lion jumping from a big stone at night"

image = base(
    prompt=prompt,
    num_inference_steps=n_steps,
    denoising_end=high_noise_frac,
    output_type="latent",
).images
image = refiner(
    prompt=prompt,
    num_inference_steps=n_steps,
    denoising_start=high_noise_frac,
    image=image,
).images[0]

์ด๋ฏธ์ง€๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์›๋ž˜์˜ ์ด๋ฏธ์ง€ Denoiser๋“ค์˜ ์•™์ƒ๋ธ”
lion_base lion_ref

๋™์ผํ•œ 40 ๋‹จ๊ณ„์—์„œ base ๋ชจ๋ธ์„ ์‹คํ–‰ํ•œ๋‹ค๋ฉด, ์ด๋ฏธ์ง€์˜ ๋””ํ…Œ์ผ(์˜ˆ: ์‚ฌ์ž์˜ ๋ˆˆ๊ณผ ์ฝ”)์ด ๋–จ์–ด์กŒ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค:

์•™์ƒ๋ธ” ๋ฐฉ์‹์€ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์Šค์ผ€์ค„๋Ÿฌ์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค!

2.) ๋…ธ์ด์ฆˆ๊ฐ€ ์™„์ „ํžˆ ์ œ๊ฑฐ๋œ ๊ธฐ๋ณธ ์ด๋ฏธ์ง€์—์„œ ์ด๋ฏธ์ง€ ์ถœ๋ ฅ์„ ์ •์ œํ•˜๊ธฐ

์ผ๋ฐ˜์ ์ธ [StableDiffusionImg2ImgPipeline] ๋ฐฉ์‹์—์„œ, ๊ธฐ๋ณธ ๋ชจ๋ธ์—์„œ ์ƒ์„ฑ๋œ ์™„์ „ํžˆ ๋…ธ์ด์ฆˆ๊ฐ€ ์ œ๊ฑฐ๋œ ์ด๋ฏธ์ง€๋Š” refiner checkpoint๋ฅผ ์‚ฌ์šฉํ•ด ๋” ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ์œ„ํ•ด, ๋ณดํ†ต์˜ "base" text-to-image ํŒŒ์ดํ”„๋ผ์ธ์„ ์ˆ˜ํ–‰ ํ›„์— image-to-image ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ์จ refiner๋ฅผ ์‹คํ–‰์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. base ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ์ž ์žฌ ๊ณต๊ฐ„์— ๋‚จ๊ฒจ๋‘˜ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=pipe.text_encoder_2,
    vae=pipe.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
refiner.to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
image = refiner(prompt=prompt, image=image[None, :]).images[0]
์›๋ž˜์˜ ์ด๋ฏธ์ง€ ์ •์ œ๋œ ์ด๋ฏธ์ง€

refiner๋Š” ๋˜ํ•œ ์ธํŽ˜์ธํŒ… ์„ค์ •์— ์ž˜ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์— ๋ณด์—ฌ์ง€๋“ฏ์ด [StableDiffusionXLInpaintPipeline] ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋งŒ๋“ค์–ด๋ณด์„ธ์š”.

Denoiser ์•™์ƒ๋ธ” ์„ค์ •์—์„œ ์ธํŽ˜์ธํŒ…์— refiner๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

from diffusers import StableDiffusionXLInpaintPipeline
from diffusers.utils import load_image

pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=pipe.text_encoder_2,
    vae=pipe.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
refiner.to("cuda")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = load_image(img_url).convert("RGB")
mask_image = load_image(mask_url).convert("RGB")

prompt = "A majestic tiger sitting on a bench"
num_inference_steps = 75
high_noise_frac = 0.7

image = pipe(
    prompt=prompt,
    image=init_image,
    mask_image=mask_image,
    num_inference_steps=num_inference_steps,
    denoising_start=high_noise_frac,
    output_type="latent",
).images
image = refiner(
    prompt=prompt,
    image=image,
    mask_image=mask_image,
    num_inference_steps=num_inference_steps,
    denoising_start=high_noise_frac,
).images[0]

์ผ๋ฐ˜์ ์ธ SDE ์„ค์ •์—์„œ ์ธํŽ˜์ธํŒ…์— refiner๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด, denoising_end์™€ denoising_start๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  refiner์˜ ์ถ”๋ก  ๋‹จ๊ณ„์˜ ์ˆ˜๋ฅผ ์ ๊ฒŒ ์„ ํƒํ•˜์„ธ์š”.

๋‹จ๋… ์ฒดํฌํฌ์ธํŠธ ํŒŒ์ผ / ์›๋ž˜์˜ ํŒŒ์ผ ํ˜•์‹์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

[~diffusers.loaders.FromSingleFileMixin.from_single_file]๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์›๋ž˜์˜ ํŒŒ์ผ ํ˜•์‹์„ diffusers ํ˜•์‹์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch

pipe = StableDiffusionXLPipeline.from_single_file(
    "./sd_xl_base_1.0.safetensors", torch_dtype=torch.float16
)
pipe.to("cuda")

refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
    "./sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16
)
refiner.to("cuda")

๋ชจ๋ธ offloading์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”ํ•˜๊ธฐ

out-of-memory ์—๋Ÿฌ๊ฐ€ ๋‚œ๋‹ค๋ฉด, [StableDiffusionXLPipeline.enable_model_cpu_offload]์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

- pipe.to("cuda")
+ pipe.enable_model_cpu_offload()

๊ทธ๋ฆฌ๊ณ 

- refiner.to("cuda")
+ refiner.enable_model_cpu_offload()

torch.compile๋กœ ์ถ”๋ก  ์†๋„๋ฅผ ์˜ฌ๋ฆฌ๊ธฐ

torch.compile๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์ถ”๋ก  ์†๋„๋ฅผ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ca. 20% ์†๋„ ํ–ฅ์ƒ์ด ๋ฉ๋‹ˆ๋‹ค.

+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)

torch < 2.0์ผ ๋•Œ ์‹คํ–‰ํ•˜๊ธฐ

์ฐธ๊ณ  Stable Diffusion XL์„ torch๊ฐ€ 2.0 ๋ฒ„์ „ ๋ฏธ๋งŒ์—์„œ ์‹คํ–‰์‹œํ‚ค๊ณ  ์‹ถ์„ ๋•Œ, xformers ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”:

pip install xformers
+pipe.enable_xformers_memory_efficient_attention()
+refiner.enable_xformers_memory_efficient_attention()

StableDiffusionXLPipeline

[[autodoc]] StableDiffusionXLPipeline - all - call

StableDiffusionXLImg2ImgPipeline

[[autodoc]] StableDiffusionXLImg2ImgPipeline - all - call

StableDiffusionXLInpaintPipeline

[[autodoc]] StableDiffusionXLInpaintPipeline - all - call

๊ฐ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์— ๋‹ค๋ฅธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•˜๊ธฐ

Stable Diffusion XL๋Š” ๋‘ ๊ฐœ์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์— ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋™์ž‘์€ ๊ฐ ํ”„๋กฌํ”„ํŠธ์— ๋™์ผํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ถ€ ์‚ฌ์šฉ์ž๊ฐ€ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ง€์ ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋งˆ๋‹ค ๋‹ค๋ฅธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•˜๋ ค๋ฉด, prompt_2์™€ negative_prompt_2๋ฅผ prompt์™€ negative_prompt์— ์ „๋‹ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ, ์›๋ž˜์˜ ํ”„๋กฌํ”„ํŠธ๋“ค(prompt)๊ณผ ๋ถ€์ • ํ”„๋กฌํ”„ํŠธ๋“ค(negative_prompt)๋ฅผ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์— ์ „๋‹ฌํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.(๊ณต์‹ SDXL 0.9/1.0์˜ OpenAI CLIP-ViT/L-14์—์„œ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.) ๊ทธ๋ฆฌ๊ณ  prompt_2์™€ negative_prompt_2๋Š” text_encoder_2์— ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค.(๊ณต์‹ SDXL 0.9/1.0์˜ OpenCLIP-ViT/bigG-14์—์„œ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

# OAI CLIP-ViT/L-14์— prompt๊ฐ€ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# OpenCLIP-ViT/bigG-14์— prompt_2๊ฐ€ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค
prompt_2 = "monet painting"
image = pipe(prompt=prompt, prompt_2=prompt_2).images[0]