svjack's picture
Upload 1392 files
43b7e92 verified
|
raw
history blame
17.3 kB

λ©”λͺ¨λ¦¬μ™€ 속도

λ©”λͺ¨λ¦¬ λ˜λŠ” 속도에 λŒ€ν•΄ πŸ€— Diffusers 좔둠을 μ΅œμ ν™”ν•˜κΈ° μœ„ν•œ λͺ‡ 가지 기술과 아이디어λ₯Ό μ œμ‹œν•©λ‹ˆλ‹€. 일반적으둜, memory-efficient attention을 μœ„ν•΄ xFormers μ‚¬μš©μ„ μΆ”μ²œν•˜κΈ° λ•Œλ¬Έμ—, μΆ”μ²œν•˜λŠ” μ„€μΉ˜ 방법을 보고 μ„€μΉ˜ν•΄ λ³΄μ„Έμš”.

λ‹€μŒ 섀정이 μ„±λŠ₯κ³Ό λ©”λͺ¨λ¦¬μ— λ―ΈμΉ˜λŠ” 영ν–₯에 λŒ€ν•΄ μ„€λͺ…ν•©λ‹ˆλ‹€.

μ§€μ—°μ‹œκ°„ 속도 ν–₯상
별도 μ„€μ • μ—†μŒ 9.50s x1
cuDNN auto-tuner 9.37s x1.01
fp16 3.61s x2.63
Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹ 3.30s x2.88
traced UNet 3.21s x2.96
memory-efficient attention 2.63s x3.61
NVIDIA TITAN RTXμ—μ„œ 50 DDIM μŠ€ν…μ˜ "a photo of an astronaut riding a horse on mars" ν”„λ‘¬ν”„νŠΈλ‘œ 512x512 크기의 단일 이미지λ₯Ό μƒμ„±ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

cuDNN auto-tuner ν™œμ„±ν™”ν•˜κΈ°

NVIDIA cuDNN은 μ»¨λ³Όλ£¨μ…˜μ„ κ³„μ‚°ν•˜λŠ” λ§Žμ€ μ•Œκ³ λ¦¬μ¦˜μ„ μ§€μ›ν•©λ‹ˆλ‹€. AutotunerλŠ” 짧은 벀치마크λ₯Ό μ‹€ν–‰ν•˜κ³  주어진 μž…λ ₯ 크기에 λŒ€ν•΄ 주어진 ν•˜λ“œμ›¨μ–΄μ—μ„œ 졜고의 μ„±λŠ₯을 가진 컀널을 μ„ νƒν•©λ‹ˆλ‹€.

μ»¨λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬λ₯Ό ν™œμš©ν•˜κ³  있기 λ•Œλ¬Έμ— (λ‹€λ₯Έ μœ ν˜•λ“€μ€ ν˜„μž¬ μ§€μ›λ˜μ§€ μ•ŠμŒ), λ‹€μŒ 섀정을 톡해 μΆ”λ‘  전에 cuDNN autotunerλ₯Ό ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

import torch

torch.backends.cudnn.benchmark = True

fp32 λŒ€μ‹  tf32 μ‚¬μš©ν•˜κΈ° (Ampere 및 이후 CUDA μž₯μΉ˜λ“€μ—μ„œ)

Ampere 및 이후 CUDA μž₯μΉ˜μ—μ„œ ν–‰λ ¬κ³± 및 μ»¨λ³Όλ£¨μ…˜μ€ TensorFloat32(TF32) λͺ¨λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ 더 λΉ λ₯΄μ§€λ§Œ μ•½κ°„ 덜 μ •ν™•ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 기본적으둜 PyTorchλŠ” μ»¨λ³Όλ£¨μ…˜μ— λŒ€ν•΄ TF32 λͺ¨λ“œλ₯Ό ν™œμ„±ν™”ν•˜μ§€λ§Œ ν–‰λ ¬ κ³±μ…ˆμ€ ν™œμ„±ν™”ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. λ„€νŠΈμ›Œν¬μ— μ™„μ „ν•œ float32 정밀도가 ν•„μš”ν•œ κ²½μš°κ°€ μ•„λ‹ˆλ©΄ ν–‰λ ¬ κ³±μ…ˆμ— λŒ€ν•΄μ„œλ„ 이 섀정을 ν™œμ„±ν™”ν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€. μ΄λŠ” 일반적으둜 λ¬΄μ‹œν•  수 μžˆλŠ” 수치의 정확도 손싀이 μžˆμ§€λ§Œ, 계산 속도λ₯Ό 크게 높일 수 μžˆμŠ΅λ‹ˆλ‹€. 그것에 λŒ€ν•΄ μ—¬κΈ°μ„œ 더 읽을 수 μžˆμŠ΅λ‹ˆλ‹€. μΆ”λ‘ ν•˜κΈ° 전에 λ‹€μŒμ„ μΆ”κ°€ν•˜κΈ°λ§Œ ν•˜λ©΄ λ©λ‹ˆλ‹€:

import torch

torch.backends.cuda.matmul.allow_tf32 = True

λ°˜μ •λ°€λ„ κ°€μ€‘μΉ˜

더 λ§Žμ€ GPU λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•˜κ³  더 λΉ λ₯Έ 속도λ₯Ό μ–»κΈ° μœ„ν•΄ λͺ¨λΈ κ°€μ€‘μΉ˜λ₯Ό λ°˜μ •λ°€λ„(half precision)둜 직접 뢈러였고 μ‹€ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ—¬κΈ°μ—λŠ” fp16μ΄λΌλŠ” λΈŒλžœμΉ˜μ— μ €μž₯된 float16 λ²„μ „μ˜ κ°€μ€‘μΉ˜λ₯Ό 뢈러였고, κ·Έ λ•Œ float16 μœ ν˜•μ„ μ‚¬μš©ν•˜λ„λ‘ PyTorch에 μ§€μ‹œν•˜λŠ” μž‘μ—…μ΄ ν¬ν•¨λ©λ‹ˆλ‹€.

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
μ–΄λ–€ νŒŒμ΄ν”„λΌμΈμ—μ„œλ„ [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) λ₯Ό μ‚¬μš©ν•˜λŠ” 것은 검은색 이미지λ₯Ό 생성할 수 있고, μˆœμˆ˜ν•œ float16 정밀도λ₯Ό μ‚¬μš©ν•˜λŠ” 것보닀 항상 느리기 λ•Œλ¬Έμ— μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.

μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ 슬라이슀 μ–΄ν…μ…˜

μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄, ν•œ λ²ˆμ— λͺ¨λ‘ κ³„μ‚°ν•˜λŠ” λŒ€μ‹  λ‹¨κ³„μ μœΌλ‘œ 계산을 μˆ˜ν–‰ν•˜λŠ” 슬라이슀 λ²„μ „μ˜ μ–΄ν…μ…˜(attention)을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

Attention slicing은 λͺ¨λΈμ΄ ν•˜λ‚˜ μ΄μƒμ˜ μ–΄ν…μ…˜ ν—€λ“œλ₯Ό μ‚¬μš©ν•˜λŠ” ν•œ, 배치 크기가 1인 κ²½μš°μ—λ„ μœ μš©ν•©λ‹ˆλ‹€. ν•˜λ‚˜ μ΄μƒμ˜ μ–΄ν…μ…˜ ν—€λ“œκ°€ μžˆλŠ” 경우 *QK^T* μ–΄ν…μ…˜ λ§€νŠΈλ¦­μŠ€λŠ” μƒλ‹Ήν•œ μ–‘μ˜ λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•  수 μžˆλŠ” 각 ν—€λ“œμ— λŒ€ν•΄ 순차적으둜 계산될 수 μžˆμŠ΅λ‹ˆλ‹€.

각 ν—€λ“œμ— λŒ€ν•΄ 순차적으둜 μ–΄ν…μ…˜ 계산을 μˆ˜ν–‰ν•˜λ €λ©΄, λ‹€μŒκ³Ό 같이 μΆ”λ‘  전에 νŒŒμ΄ν”„λΌμΈμ—μ„œ [~StableDiffusionPipeline.enable_attention_slicing]λ₯Ό ν˜ΈμΆœν•˜λ©΄ λ©λ‹ˆλ‹€:

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_attention_slicing()
image = pipe(prompt).images[0]

μΆ”λ‘  μ‹œκ°„μ΄ μ•½ 10% λŠλ €μ§€λŠ” μ•½κ°„μ˜ μ„±λŠ₯ μ €ν•˜κ°€ μžˆμ§€λ§Œ 이 방법을 μ‚¬μš©ν•˜λ©΄ 3.2GB μ •λ„μ˜ μž‘μ€ VRAMμœΌλ‘œλ„ Stable Diffusion을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€!

더 큰 배치λ₯Ό μœ„ν•œ sliced VAE λ””μ½”λ“œ

μ œν•œλœ VRAMμ—μ„œ λŒ€κ·œλͺ¨ 이미지 배치λ₯Ό λ””μ½”λ”©ν•˜κ±°λ‚˜ 32개 μ΄μƒμ˜ 이미지가 ν¬ν•¨λœ 배치λ₯Ό ν™œμ„±ν™”ν•˜κΈ° μœ„ν•΄, 배치의 latent 이미지λ₯Ό ν•œ λ²ˆμ— ν•˜λ‚˜μ”© λ””μ½”λ”©ν•˜λŠ” 슬라이슀 VAE λ””μ½”λ“œλ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

이λ₯Ό [~StableDiffusionPipeline.enable_attention_slicing] λ˜λŠ” [~StableDiffusionPipeline.enable_xformers_memory_efficient_attention]κ³Ό κ²°ν•©ν•˜μ—¬ λ©”λͺ¨λ¦¬ μ‚¬μš©μ„ μΆ”κ°€λ‘œ μ΅œμ†Œν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

VAE λ””μ½”λ“œλ₯Ό ν•œ λ²ˆμ— ν•˜λ‚˜μ”© μˆ˜ν–‰ν•˜λ €λ©΄ μΆ”λ‘  전에 νŒŒμ΄ν”„λΌμΈμ—μ„œ [~StableDiffusionPipeline.enable_vae_slicing]을 ν˜ΈμΆœν•©λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_vae_slicing()
images = pipe([prompt] * 32).images

닀쀑 이미지 λ°°μΉ˜μ—μ„œ VAE λ””μ½”λ“œκ°€ μ•½κ°„μ˜ μ„±λŠ₯ ν–₯상이 μ΄λ£¨μ–΄μ§‘λ‹ˆλ‹€. 단일 이미지 λ°°μΉ˜μ—μ„œλŠ” μ„±λŠ₯ 영ν–₯은 μ—†μŠ΅λ‹ˆλ‹€.

λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄ 가속 κΈ°λŠ₯을 μ‚¬μš©ν•˜μ—¬ CPU둜 μ˜€ν”„λ‘œλ”©

μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄ κ°€μ€‘μΉ˜λ₯Ό CPU둜 μ˜€ν”„λ‘œλ“œν•˜κ³  순방ν–₯ 전달을 μˆ˜ν–‰ν•  λ•Œλ§Œ GPU둜 λ‘œλ“œν•  수 μžˆμŠ΅λ‹ˆλ‹€.

CPU μ˜€ν”„λ‘œλ”©μ„ μˆ˜ν–‰ν•˜λ €λ©΄ [~StableDiffusionPipeline.enable_sequential_cpu_offload]λ₯Ό ν˜ΈμΆœν•˜κΈ°λ§Œ ν•˜λ©΄ λ©λ‹ˆλ‹€:

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt).images[0]

그러면 λ©”λͺ¨λ¦¬ μ†ŒλΉ„λ₯Ό 3GB 미만으둜 쀄일 수 μžˆμŠ΅λ‹ˆλ‹€.

참고둜 이 방법은 전체 λͺ¨λΈμ΄ μ•„λ‹Œ μ„œλΈŒλͺ¨λ“ˆ μˆ˜μ€€μ—μ„œ μž‘λ™ν•©λ‹ˆλ‹€. μ΄λŠ” λ©”λͺ¨λ¦¬ μ†ŒλΉ„λ₯Ό μ΅œμ†Œν™”ν•˜λŠ” κ°€μž₯ 쒋은 λ°©λ²•μ΄μ§€λ§Œ ν”„λ‘œμ„ΈμŠ€μ˜ 반볡적 νŠΉμ„±μœΌλ‘œ 인해 μΆ”λ‘  속도가 훨씬 λŠλ¦½λ‹ˆλ‹€. νŒŒμ΄ν”„λΌμΈμ˜ UNet ꡬ성 μš”μ†ŒλŠ” μ—¬λŸ¬ 번 μ‹€ν–‰λ©λ‹ˆλ‹€('num_inference_steps' 만큼). 맀번 UNet의 μ„œλ‘œ λ‹€λ₯Έ μ„œλΈŒλͺ¨λ“ˆμ΄ 순차적으둜 μ˜¨λ‘œλ“œλœ λ‹€μŒ ν•„μš”μ— 따라 μ˜€ν”„λ‘œλ“œλ˜λ―€λ‘œ λ©”λͺ¨λ¦¬ 이동 νšŸμˆ˜κ°€ λ§ŽμŠ΅λ‹ˆλ‹€.

또 λ‹€λ₯Έ μ΅œμ ν™” 방법인 λͺ¨λΈ μ˜€ν”„λ‘œλ”©μ„ μ‚¬μš©ν•˜λŠ” 것을 κ³ λ €ν•˜μ‹­μ‹œμ˜€. μ΄λŠ” 훨씬 λΉ λ₯΄μ§€λ§Œ λ©”λͺ¨λ¦¬ μ ˆμ•½μ΄ ν¬μ§€λŠ” μ•ŠμŠ΅λ‹ˆλ‹€.

λ˜ν•œ ttention slicingκ³Ό μ—°κ²°ν•΄μ„œ μ΅œμ†Œ λ©”λͺ¨λ¦¬(< 2GB)λ‘œλ„ λ™μž‘ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",

    torch_dtype=torch.float16,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing(1)

image = pipe(prompt).images[0]

μ°Έκ³ : 'enable_sequential_cpu_offload()'λ₯Ό μ‚¬μš©ν•  λ•Œ, 미리 νŒŒμ΄ν”„λΌμΈμ„ CUDA둜 μ΄λ™ν•˜μ§€ μ•ŠλŠ” 것이 μ€‘μš”ν•©λ‹ˆλ‹€.그렇지 μ•ŠμœΌλ©΄ λ©”λͺ¨λ¦¬ μ†ŒλΉ„μ˜ 이득이 μ΅œμ†Œν™”λ©λ‹ˆλ‹€. 더 λ§Žμ€ 정보λ₯Ό μœ„ν•΄ 이 이슈λ₯Ό λ³΄μ„Έμš”.

λΉ λ₯Έ μΆ”λ‘ κ³Ό λ©”λͺ¨λ¦¬ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ λͺ¨λΈ μ˜€ν”„λ‘œλ”©

순차적 CPU μ˜€ν”„λ‘œλ”©μ€ 이전 μ„Ήμ…˜μ—μ„œ μ„€λͺ…ν•œ κ²ƒμ²˜λŸΌ λ§Žμ€ λ©”λͺ¨λ¦¬λ₯Ό λ³΄μ‘΄ν•˜μ§€λ§Œ ν•„μš”μ— 따라 μ„œλΈŒλͺ¨λ“ˆμ„ GPU둜 μ΄λ™ν•˜κ³  μƒˆ λͺ¨λ“ˆμ΄ 싀행될 λ•Œ μ¦‰μ‹œ CPU둜 λ°˜ν™˜λ˜κΈ° λ•Œλ¬Έμ— μΆ”λ‘  속도가 λŠλ €μ§‘λ‹ˆλ‹€.

전체 λͺ¨λΈ μ˜€ν”„λ‘œλ”©μ€ 각 λͺ¨λΈμ˜ ꡬ성 μš”μ†ŒμΈ _modules_을 μ²˜λ¦¬ν•˜λŠ” λŒ€μ‹ , 전체 λͺ¨λΈμ„ GPU둜 μ΄λ™ν•˜λŠ” λŒ€μ•ˆμž…λ‹ˆλ‹€. 이둜 인해 μΆ”λ‘  μ‹œκ°„μ— λ―ΈμΉ˜λŠ” 영ν–₯은 λ―Έλ―Έν•˜μ§€λ§Œ(νŒŒμ΄ν”„λΌμΈμ„ 'cuda'둜 μ΄λ™ν•˜λŠ” 것과 λΉ„κ΅ν•˜μ—¬) μ—¬μ „νžˆ μ•½κ°„μ˜ λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

이 μ‹œλ‚˜λ¦¬μ˜€μ—μ„œλŠ” νŒŒμ΄ν”„λΌμΈμ˜ μ£Όμš” ꡬ성 μš”μ†Œ 쀑 ν•˜λ‚˜λ§Œ(일반적으둜 ν…μŠ€νŠΈ 인코더, unet 및 vae) GPU에 있고, λ‚˜λ¨Έμ§€λŠ” CPUμ—μ„œ λŒ€κΈ°ν•  κ²ƒμž…λ‹ˆλ‹€. μ—¬λŸ¬ λ°˜λ³΅μ„ μœ„ν•΄ μ‹€ν–‰λ˜λŠ” UNetκ³Ό 같은 ꡬ성 μš”μ†ŒλŠ” 더 이상 ν•„μš”ν•˜μ§€ μ•Šμ„ λ•ŒκΉŒμ§€ GPU에 남아 μžˆμŠ΅λ‹ˆλ‹€.

이 κΈ°λŠ₯은 μ•„λž˜μ™€ 같이 νŒŒμ΄ν”„λΌμΈμ—μ„œ enable_model_cpu_offload()λ₯Ό ν˜ΈμΆœν•˜μ—¬ ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
image = pipe(prompt).images[0]

μ΄λŠ” 좔가적인 λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ attention slicing과도 ν˜Έν™˜λ©λ‹ˆλ‹€.

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing(1)

image = pipe(prompt).images[0]
이 κΈ°λŠ₯을 μ‚¬μš©ν•˜λ €λ©΄ 'accelerate' 버전 0.17.0 이상이 ν•„μš”ν•©λ‹ˆλ‹€.

Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹ μ‚¬μš©ν•˜κΈ°

Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹μ€ 차원 μˆœμ„œλ₯Ό λ³΄μ‘΄ν•˜λŠ” λ©”λͺ¨λ¦¬μ—μ„œ NCHW ν…μ„œ 배열을 λŒ€μ²΄ν•˜λŠ” λ°©λ²•μž…λ‹ˆλ‹€. Channels Last ν…μ„œλŠ” 채널이 κ°€μž₯ μ‘°λ°€ν•œ 차원이 λ˜λŠ” λ°©μ‹μœΌλ‘œ μ •λ ¬λ©λ‹ˆλ‹€(일λͺ… ν”½μ…€λ‹Ή 이미지λ₯Ό μ €μž₯). ν˜„μž¬ λͺ¨λ“  μ—°μ‚°μž Channels Last ν˜•μ‹μ„ μ§€μ›ν•˜λŠ” 것은 μ•„λ‹ˆλΌ μ„±λŠ₯이 μ €ν•˜λ  수 μžˆμœΌλ―€λ‘œ, μ‚¬μš©ν•΄λ³΄κ³  λͺ¨λΈμ— 잘 μž‘λ™ν•˜λŠ”μ§€ ν™•μΈν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.

예λ₯Ό λ“€μ–΄ νŒŒμ΄ν”„λΌμΈμ˜ UNet λͺ¨λΈμ΄ channels Last ν˜•μ‹μ„ μ‚¬μš©ν•˜λ„λ‘ μ„€μ •ν•˜λ €λ©΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

print(pipe.unet.conv_out.state_dict()["weight"].stride())  # (2880, 9, 3, 1)
pipe.unet.to(memory_format=torch.channels_last)  # in-place μ—°μ‚°
# 2번째 μ°¨μ›μ—μ„œ μŠ€νŠΈλΌμ΄λ“œ 1을 κ°€μ§€λŠ” (2880, 1, 960, 320)둜, 연산이 μž‘λ™ν•¨μ„ 증λͺ…ν•©λ‹ˆλ‹€.
print(pipe.unet.conv_out.state_dict()["weight"].stride())

좔적(tracing)

좔적은 λͺ¨λΈμ„ 톡해 예제 μž…λ ₯ ν…μ„œλ₯Ό 톡해 μ‹€ν–‰λ˜λŠ”λ°, ν•΄λ‹Ή μž…λ ₯이 λͺ¨λΈμ˜ λ ˆμ΄μ–΄λ₯Ό 톡과할 λ•Œ ν˜ΈμΆœλ˜λŠ” μž‘μ—…μ„ μΊ‘μ²˜ν•˜μ—¬ μ‹€ν–‰ 파일 λ˜λŠ” 'ScriptFunction'이 λ°˜ν™˜λ˜λ„λ‘ ν•˜κ³ , μ΄λŠ” just-in-time 컴파일둜 μ΅œμ ν™”λ©λ‹ˆλ‹€.

UNet λͺ¨λΈμ„ μΆ”μ ν•˜κΈ° μœ„ν•΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

import time
import torch
from diffusers import StableDiffusionPipeline
import functools

# torch 기울기 λΉ„ν™œμ„±ν™”
torch.set_grad_enabled(False)

# λ³€μˆ˜ μ„€μ •
n_experiments = 2
unet_runs_per_experiment = 50


# μž…λ ₯ 뢈러였기
def generate_inputs():
    sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
    timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
    encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
    return sample, timestep, encoder_hidden_states


pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")
unet = pipe.unet
unet.eval()
unet.to(memory_format=torch.channels_last)  # Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹ μ‚¬μš©
unet.forward = functools.partial(unet.forward, return_dict=False)  # return_dict=False을 κΈ°λ³Έκ°’μœΌλ‘œ μ„€μ •

# μ›Œλ°μ—…
for _ in range(3):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet(*inputs)

# 좔적
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")


# μ›Œλ°μ—… 및 κ·Έλž˜ν”„ μ΅œμ ν™”
for _ in range(5):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet_traced(*inputs)


# λ²€μΉ˜λ§ˆν‚Ή
with torch.inference_mode():
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet_traced(*inputs)
        torch.cuda.synchronize()
        print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet(*inputs)
        torch.cuda.synchronize()
        print(f"unet inference took {time.time() - start_time:.2f} seconds")

# λͺ¨λΈ μ €μž₯
unet_traced.save("unet_traced.pt")

κ·Έ λ‹€μŒ, νŒŒμ΄ν”„λΌμΈμ˜ unet νŠΉμ„±μ„ λ‹€μŒκ³Ό 같이 μΆ”μ λœ λͺ¨λΈλ‘œ λ°”κΏ€ 수 μžˆμŠ΅λ‹ˆλ‹€.

from diffusers import StableDiffusionPipeline
import torch
from dataclasses import dataclass


@dataclass
class UNet2DConditionOutput:
    sample: torch.Tensor


pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

# jitted unet μ‚¬μš©
unet_traced = torch.jit.load("unet_traced.pt")


# pipe.unet μ‚­μ œ
class TracedUNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.in_channels = pipe.unet.config.in_channels
        self.device = pipe.unet.device

    def forward(self, latent_model_input, t, encoder_hidden_states):
        sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)


pipe.unet = TracedUNet()

with torch.inference_mode():
    image = pipe([prompt] * 1, num_inference_steps=50).images[0]

Memory-efficient attention

μ–΄ν…μ…˜ λΈ”λ‘μ˜ λŒ€μ—­ν­μ„ μ΅œμ ν™”ν•˜λŠ” 졜근 μž‘μ—…μœΌλ‘œ GPU λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 크게 ν–₯μƒλ˜κ³  ν–₯μƒλ˜μ—ˆμŠ΅λ‹ˆλ‹€. @tridao의 κ°€μž₯ 졜근의 ν”Œλž˜μ‹œ μ–΄ν…μ…˜: code, paper.

배치 크기 1(ν”„λ‘¬ν”„νŠΈ 1개)의 512x512 크기둜 좔둠을 μ‹€ν–‰ν•  λ•Œ λͺ‡ 가지 Nvidia GPUμ—μ„œ 얻은 속도 ν–₯상은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

| GPU | κΈ°μ€€ μ–΄ν…μ…˜ FP16 | λ©”λͺ¨λ¦¬ 효율적인 μ–΄ν…μ…˜ FP16 | |------------------ |--------------------- |--------------------------------- | | NVIDIA Tesla T4 | 3.5it/s | 5.5it/s | | NVIDIA 3060 RTX | 4.6it/s | 7.8it/s | | NVIDIA A10G | 8.88it/s | 15.6it/s | | NVIDIA RTX A6000 | 11.7it/s | 21.09it/s | | NVIDIA TITAN RTX | 12.51it/s | 18.22it/s | | A100-SXM4-40GB | 18.6it/s | 29.it/s | | A100-SXM-80GB | 18.7it/s | 29.5it/s |

이λ₯Ό ν™œμš©ν•˜λ €λ©΄ λ‹€μŒμ„ λ§Œμ‘±ν•΄μ•Ό ν•©λ‹ˆλ‹€:

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

pipe.enable_xformers_memory_efficient_attention()

with torch.inference_mode():
    sample = pipe("a small cat")

# 선택: 이λ₯Ό λΉ„ν™œμ„±ν™” ν•˜κΈ° μœ„ν•΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
# pipe.disable_xformers_memory_efficient_attention()