svjack's picture
Upload 1392 files
43b7e92 verified
|
raw
history blame
21.7 kB

Diffusersμ—μ„œμ˜ PyTorch 2.0 가속화 지원

0.13.0 버전뢀터 DiffusersλŠ” PyTorch 2.0μ—μ„œμ˜ μ΅œμ‹  μ΅œμ ν™”λ₯Ό μ§€μ›ν•©λ‹ˆλ‹€. μ΄λŠ” λ‹€μŒμ„ ν¬ν•¨λ©λ‹ˆλ‹€.

  1. momory-efficient attention을 μ‚¬μš©ν•œ κ°€μ†ν™”λœ 트랜슀포머 지원 - xformers같은 좔가적인 dependencies ν•„μš” μ—†μŒ
  2. μΆ”κ°€ μ„±λŠ₯ ν–₯상을 μœ„ν•œ κ°œλ³„ λͺ¨λΈμ— λŒ€ν•œ 컴파일 κΈ°λŠ₯ torch.compile 지원

μ„€μΉ˜

κ°€μ†ν™”λœ μ–΄ν…μ…˜ κ΅¬ν˜„κ³Ό 및 torch.compile()을 μ‚¬μš©ν•˜κΈ° μœ„ν•΄, pipμ—μ„œ μ΅œμ‹  λ²„μ „μ˜ PyTorch 2.0을 μ„€μΉ˜λ˜μ–΄ 있고 diffusers 0.13.0. 버전 이상인지 ν™•μΈν•˜μ„Έμš”. μ•„λž˜ μ„€λͺ…λœ 바와 같이, PyTorch 2.0이 ν™œμ„±ν™”λ˜μ–΄ μžˆμ„ λ•Œ diffusersλŠ” μ΅œμ ν™”λœ μ–΄ν…μ…˜ ν”„λ‘œμ„Έμ„œ(AttnProcessor2_0)λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.

pip install --upgrade torch diffusers

κ°€μ†ν™”λœ νŠΈλžœμŠ€ν¬λ¨Έμ™€ torch.compile μ‚¬μš©ν•˜κΈ°.

  1. κ°€μ†ν™”λœ 트랜슀포머 κ΅¬ν˜„

    PyTorch 2.0μ—λŠ” torch.nn.functional.scaled_dot_product_attention ν•¨μˆ˜λ₯Ό 톡해 μ΅œμ ν™”λœ memory-efficient attention의 κ΅¬ν˜„μ΄ ν¬ν•¨λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” μž…λ ₯ 및 GPU μœ ν˜•μ— 따라 μ—¬λŸ¬ μ΅œμ ν™”λ₯Ό μžλ™μœΌλ‘œ ν™œμ„±ν™”ν•©λ‹ˆλ‹€. μ΄λŠ” xFormers의 memory_efficient_attentionκ³Ό μœ μ‚¬ν•˜μ§€λ§Œ 기본적으둜 PyTorch에 λ‚΄μž₯λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

    μ΄λŸ¬ν•œ μ΅œμ ν™”λŠ” PyTorch 2.0이 μ„€μΉ˜λ˜μ–΄ 있고 torch.nn.functional.scaled_dot_product_attention을 μ‚¬μš©ν•  수 μžˆλŠ” 경우 Diffusersμ—μ„œ 기본적으둜 ν™œμ„±ν™”λ©λ‹ˆλ‹€. 이λ₯Ό μ‚¬μš©ν•˜λ €λ©΄ torch 2.0을 μ„€μΉ˜ν•˜κ³  νŒŒμ΄ν”„λΌμΈμ„ μ‚¬μš©ν•˜κΈ°λ§Œ ν•˜λ©΄ λ©λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:

    import torch
    from diffusers import DiffusionPipeline
    
    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt).images[0]
    

    이λ₯Ό λͺ…μ‹œμ μœΌλ‘œ ν™œμ„±ν™”ν•˜λ €λ©΄(ν•„μˆ˜λŠ” μ•„λ‹˜) μ•„λž˜μ™€ 같이 μˆ˜ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

    import torch
    from diffusers import DiffusionPipeline
    + from diffusers.models.attention_processor import AttnProcessor2_0
    
    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
    + pipe.unet.set_attn_processor(AttnProcessor2_0())
    
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt).images[0]
    

    이 μ‹€ν–‰ 과정은 xFormers만큼 λΉ λ₯΄κ³  λ©”λͺ¨λ¦¬μ μœΌλ‘œ νš¨μœ¨μ μ΄μ–΄μ•Ό ν•©λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ λ²€μΉ˜λ§ˆν¬μ—μ„œ ν™•μΈν•˜μ„Έμš”.

    νŒŒμ΄ν”„λΌμΈμ„ 보닀 deterministic으둜 λ§Œλ“€κ±°λ‚˜ 파인 νŠœλ‹λœ λͺ¨λΈμ„ Core MLκ³Ό 같은 λ‹€λ₯Έ ν˜•μ‹μœΌλ‘œ λ³€ν™˜ν•΄μ•Ό ν•˜λŠ” 경우 바닐라 μ–΄ν…μ…˜ ν”„λ‘œμ„Έμ„œ (AttnProcessor)둜 되돌릴 수 μžˆμŠ΅λ‹ˆλ‹€. 일반 μ–΄ν…μ…˜ ν”„λ‘œμ„Έμ„œλ₯Ό μ‚¬μš©ν•˜λ €λ©΄ [~diffusers.UNet2DConditionModel.set_default_attn_processor] ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

    import torch
    from diffusers import DiffusionPipeline
    from diffusers.models.attention_processor import AttnProcessor
    
    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
    pipe.unet.set_default_attn_processor()
    
    prompt = "a photo of an astronaut riding a horse on mars"
    image = pipe(prompt).images[0]
    
  2. torch.compile

    좔가적인 속도 ν–₯상을 μœ„ν•΄ μƒˆλ‘œμš΄ torch.compile κΈ°λŠ₯을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. νŒŒμ΄ν”„λΌμΈμ˜ UNet은 일반적으둜 계산 λΉ„μš©μ΄ κ°€μž₯ 크기 λ•Œλ¬Έμ— λ‚˜λ¨Έμ§€ ν•˜μœ„ λͺ¨λΈ(ν…μŠ€νŠΈ 인코더와 VAE)은 κ·ΈλŒ€λ‘œ 두고 unet을 torch.compile둜 λž˜ν•‘ν•©λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©κ³Ό λ‹€λ₯Έ μ˜΅μ…˜μ€ torch 컴파일 λ¬Έμ„œλ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
    images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images
    

    GPU μœ ν˜•μ— 따라 compile()은 κ°€μ†ν™”λœ 트랜슀포머 μ΅œμ ν™”λ₯Ό 톡해 **5% - 300%**의 _μΆ”κ°€ μ„±λŠ₯ ν–₯상_을 얻을 수 μžˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ μ»΄νŒŒμΌμ€ Ampere(A100, 3090), Ada(4090) 및 Hopper(H100)와 같은 μ΅œμ‹  GPU μ•„ν‚€ν…μ²˜μ—μ„œ 더 λ§Žμ€ μ„±λŠ₯ ν–₯상을 κ°€μ Έμ˜¬ 수 μžˆμŒμ„ μ°Έκ³ ν•˜μ„Έμš”.

    μ»΄νŒŒμΌμ€ μ™„λ£Œν•˜λŠ” 데 μ•½κ°„μ˜ μ‹œκ°„μ΄ κ±Έλ¦¬λ―€λ‘œ, νŒŒμ΄ν”„λΌμΈμ„ ν•œ 번 μ€€λΉ„ν•œ λ‹€μŒ λ™μΌν•œ μœ ν˜•μ˜ μΆ”λ‘  μž‘μ—…μ„ μ—¬λŸ¬ 번 μˆ˜ν–‰ν•΄μ•Ό ν•˜λŠ” 상황에 κ°€μž₯ μ ν•©ν•©λ‹ˆλ‹€. λ‹€λ₯Έ 이미지 ν¬κΈ°μ—μ„œ 컴파일된 νŒŒμ΄ν”„λΌμΈμ„ ν˜ΈμΆœν•˜λ©΄ μ‹œκ°„μ  λΉ„μš©μ΄ 많이 λ“€ 수 μžˆλŠ” 컴파일 μž‘μ—…μ΄ λ‹€μ‹œ νŠΈλ¦¬κ±°λ©λ‹ˆλ‹€.

벀치마크

PyTorch 2.0의 효율적인 μ–΄ν…μ…˜ κ΅¬ν˜„κ³Ό torch.compile을 μ‚¬μš©ν•˜μ—¬ κ°€μž₯ 많이 μ‚¬μš©λ˜λŠ” 5개의 νŒŒμ΄ν”„λΌμΈμ— λŒ€ν•΄ λ‹€μ–‘ν•œ GPU와 배치 크기에 걸쳐 포괄적인 벀치마크λ₯Ό μˆ˜ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. μ—¬κΈ°μ„œλŠ” torch.compile()이 졜적으둜 ν™œμš©λ˜λ„λ‘ ν•˜λŠ” diffusers 0.17.0.dev0을 μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

λ²€μΉ˜λ§ˆν‚Ή μ½”λ“œ

Stable Diffusion text-to-image

from diffusers import DiffusionPipeline
import torch

path = "runwayml/stable-diffusion-v1-5"

run_compile = True  # Set True / False

pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)

if run_compile:
    print("Run torch compile")
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

prompt = "ghibli style, a fantasy landscape with castles"

for _ in range(3):
    images = pipe(prompt=prompt).images

Stable Diffusion image-to-image

from diffusers import StableDiffusionImg2ImgPipeline
import requests
import torch
from PIL import Image
from io import BytesIO

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))

path = "runwayml/stable-diffusion-v1-5"

run_compile = True  # Set True / False

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)

if run_compile:
    print("Run torch compile")
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

prompt = "ghibli style, a fantasy landscape with castles"

for _ in range(3):
    image = pipe(prompt=prompt, image=init_image).images[0]

Stable Diffusion - inpainting

from diffusers import StableDiffusionInpaintPipeline
import requests
import torch
from PIL import Image
from io import BytesIO

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

def download_image(url):
    response = requests.get(url)
    return Image.open(BytesIO(response.content)).convert("RGB")


img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

path = "runwayml/stable-diffusion-inpainting"

run_compile = True  # Set True / False

pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)

if run_compile:
    print("Run torch compile")
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

prompt = "ghibli style, a fantasy landscape with castles"

for _ in range(3):
    image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]

ControlNet

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import requests
import torch
from PIL import Image
from io import BytesIO

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))

path = "runwayml/stable-diffusion-v1-5"

run_compile = True  # Set True / False
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    path, controlnet=controlnet, torch_dtype=torch.float16
)

pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)
pipe.controlnet.to(memory_format=torch.channels_last)

if run_compile:
    print("Run torch compile")
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
    pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True)

prompt = "ghibli style, a fantasy landscape with castles"

for _ in range(3):
    image = pipe(prompt=prompt, image=init_image).images[0]

IF text-to-image + upscaling

from diffusers import DiffusionPipeline
import torch

run_compile = True  # Set True / False

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
pipe.to("cuda")
pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
pipe_2.to("cuda")
pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16)
pipe_3.to("cuda")


pipe.unet.to(memory_format=torch.channels_last)
pipe_2.unet.to(memory_format=torch.channels_last)
pipe_3.unet.to(memory_format=torch.channels_last)

if run_compile:
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
    pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True)
    pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True)

prompt = "the blue hulk"

prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)

for _ in range(3):
    image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
    image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
    image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images

PyTorch 2.0 및 torch.compile()둜 얻을 수 μžˆλŠ” κ°€λŠ₯ν•œ 속도 ν–₯상에 λŒ€ν•΄, Stable Diffusion text-to-image pipeline에 λŒ€ν•œ μƒλŒ€μ μΈ 속도 ν–₯상을 λ³΄μ—¬μ£ΌλŠ” 차트λ₯Ό 5개의 μ„œλ‘œ λ‹€λ₯Έ GPU μ œν’ˆκ΅°(배치 크기 4)에 λŒ€ν•΄ λ‚˜νƒ€λƒ…λ‹ˆλ‹€:

t2i_speedup

To give you an even better idea of how this speed-up holds for the other pipelines presented above, consider the following plot that shows the benchmarking numbers from an A100 across three different batch sizes (with PyTorch 2.0 nightly and torch.compile()): 이 속도 ν–₯상이 μœ„μ— μ œμ‹œλœ λ‹€λ₯Έ νŒŒμ΄ν”„λΌμΈμ— λŒ€ν•΄μ„œλ„ μ–΄λ–»κ²Œ μœ μ§€λ˜λŠ”μ§€ 더 잘 μ΄ν•΄ν•˜κΈ° μœ„ν•΄, μ„Έ κ°€μ§€μ˜ λ‹€λ₯Έ 배치 크기에 걸쳐 A100의 λ²€μΉ˜λ§ˆν‚Ή(PyTorch 2.0 nightly 및 `torch.compile() μ‚¬μš©) 수치λ₯Ό λ³΄μ—¬μ£ΌλŠ” 차트λ₯Ό λ³΄μž…λ‹ˆλ‹€:

a100_numbers

(μœ„ 차트의 벀치마크 λ©”νŠΈλ¦­μ€ **μ΄ˆλ‹Ή iteration 수(iterations/second)**μž…λ‹ˆλ‹€)

κ·ΈλŸ¬λ‚˜ 투λͺ…성을 μœ„ν•΄ λͺ¨λ“  λ²€μΉ˜λ§ˆν‚Ή 수치λ₯Ό κ³΅κ°œν•©λ‹ˆλ‹€!

λ‹€μŒ ν‘œλ“€μ—μ„œλŠ”, μ΄ˆλ‹Ή μ²˜λ¦¬λ˜λŠ” iteration 수 μΈ‘λ©΄μ—μ„œμ˜ κ²°κ³Όλ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€.

A100 (batch size: 1)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 21.66 23.13 44.03 49.74
SD - img2img 21.81 22.40 43.92 46.32
SD - inpaint 22.24 23.23 43.76 49.25
SD - controlnet 15.02 15.82 32.13 36.08
IF 20.21 /
13.84 /
24.00
20.12 /
13.70 /
24.03
❌ 97.34 /
27.23 /
111.66

A100 (batch size: 4)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 11.6 13.12 14.62 17.27
SD - img2img 11.47 13.06 14.66 17.25
SD - inpaint 11.67 13.31 14.88 17.48
SD - controlnet 8.28 9.38 10.51 12.41
IF 25.02 18.04 ❌ 48.47

A100 (batch size: 16)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 3.04 3.6 3.83 4.68
SD - img2img 2.98 3.58 3.83 4.67
SD - inpaint 3.04 3.66 3.9 4.76
SD - controlnet 2.15 2.58 2.74 3.35
IF 8.78 9.82 ❌ 16.77

V100 (batch size: 1)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 18.99 19.14 20.95 22.17
SD - img2img 18.56 19.18 20.95 22.11
SD - inpaint 19.14 19.06 21.08 22.20
SD - controlnet 13.48 13.93 15.18 15.88
IF 20.01 /
9.08 /
23.34
19.79 /
8.98 /
24.10
❌ 55.75 /
11.57 /
57.67

V100 (batch size: 4)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 5.96 5.89 6.83 6.86
SD - img2img 5.90 5.91 6.81 6.82
SD - inpaint 5.99 6.03 6.93 6.95
SD - controlnet 4.26 4.29 4.92 4.93
IF 15.41 14.76 ❌ 22.95

V100 (batch size: 16)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 1.66 1.66 1.92 1.90
SD - img2img 1.65 1.65 1.91 1.89
SD - inpaint 1.69 1.69 1.95 1.93
SD - controlnet 1.19 1.19 OOM after warmup 1.36
IF 5.43 5.29 ❌ 7.06

T4 (batch size: 1)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 6.9 6.95 7.3 7.56
SD - img2img 6.84 6.99 7.04 7.55
SD - inpaint 6.91 6.7 7.01 7.37
SD - controlnet 4.89 4.86 5.35 5.48
IF 17.42 /
2.47 /
18.52
16.96 /
2.45 /
18.69
❌ 24.63 /
2.47 /
23.39

T4 (batch size: 4)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 1.79 1.79 2.03 1.99
SD - img2img 1.77 1.77 2.05 2.04
SD - inpaint 1.81 1.82 2.09 2.09
SD - controlnet 1.34 1.27 1.47 1.46
IF 5.79 5.61 ❌ 7.39

T4 (batch size: 16)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 2.34s 2.30s OOM after 2nd iteration 1.99s
SD - img2img 2.35s 2.31s OOM after warmup 2.00s
SD - inpaint 2.30s 2.26s OOM after 2nd iteration 1.95s
SD - controlnet OOM after 2nd iteration OOM after 2nd iteration OOM after warmup OOM after warmup
IF * 1.44 1.44 ❌ 1.94

RTX 3090 (batch size: 1)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 22.56 22.84 23.84 25.69
SD - img2img 22.25 22.61 24.1 25.83
SD - inpaint 22.22 22.54 24.26 26.02
SD - controlnet 16.03 16.33 17.38 18.56
IF 27.08 /
9.07 /
31.23
26.75 /
8.92 /
31.47
❌ 68.08 /
11.16 /
65.29

RTX 3090 (batch size: 4)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 6.46 6.35 7.29 7.3
SD - img2img 6.33 6.27 7.31 7.26
SD - inpaint 6.47 6.4 7.44 7.39
SD - controlnet 4.59 4.54 5.27 5.26
IF 16.81 16.62 ❌ 21.57

RTX 3090 (batch size: 16)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 1.7 1.69 1.93 1.91
SD - img2img 1.68 1.67 1.93 1.9
SD - inpaint 1.72 1.71 1.97 1.94
SD - controlnet 1.23 1.22 1.4 1.38
IF 5.01 5.00 ❌ 6.33

RTX 4090 (batch size: 1)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 40.5 41.89 44.65 49.81
SD - img2img 40.39 41.95 44.46 49.8
SD - inpaint 40.51 41.88 44.58 49.72
SD - controlnet 29.27 30.29 32.26 36.03
IF 69.71 /
18.78 /
85.49
69.13 /
18.80 /
85.56
❌ 124.60 /
26.37 /
138.79

RTX 4090 (batch size: 4)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 12.62 12.84 15.32 15.59
SD - img2img 12.61 12,.79 15.35 15.66
SD - inpaint 12.65 12.81 15.3 15.58
SD - controlnet 9.1 9.25 11.03 11.22
IF 31.88 31.14 ❌ 43.92

RTX 4090 (batch size: 16)

Pipeline torch 2.0 -
no compile
torch nightly -
no compile
torch 2.0 -
compile
torch nightly -
compile
SD - txt2img 3.17 3.2 3.84 3.85
SD - img2img 3.16 3.2 3.84 3.85
SD - inpaint 3.17 3.2 3.85 3.85
SD - controlnet 2.23 2.3 2.7 2.75
IF 9.26 9.2 ❌ 13.31

μ°Έκ³ 

  • Follow this PR for more details on the environment used for conducting the benchmarks.
  • For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1.

Thanks to Horace He from the PyTorch team for their support in improving our support of torch.compile() in Diffusers.

  • 벀치마크 μˆ˜ν–‰μ— μ‚¬μš©λœ ν™˜κ²½μ— λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ 이 PR을 μ°Έμ‘°ν•˜μ„Έμš”.
  • IF νŒŒμ΄ν”„λΌμΈμ™€ 배치 크기 > 1의 경우 첫 번째 IF νŒŒμ΄ν”„λΌμΈμ—μ„œ text-to-image 생성을 μœ„ν•œ 배치 크기 > 1만 μ‚¬μš©ν–ˆμœΌλ©° μ—…μŠ€μΌ€μΌλ§μ—λŠ” μ‚¬μš©ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. 즉, 두 개의 μ—…μŠ€μΌ€μΌλ§ νŒŒμ΄ν”„λΌμΈμ΄ 배치 크기 1μž„μ„ μ˜λ―Έν•©λ‹ˆλ‹€.

Diffusersμ—μ„œ torch.compile() 지원을 κ°œμ„ ν•˜λŠ” 데 도움을 μ€€ PyTorch νŒ€μ˜ Horace Heμ—κ²Œ κ°μ‚¬λ“œλ¦½λ‹ˆλ‹€.