svjack's picture
Upload 1392 files
43b7e92 verified
|
raw
history blame
17.3 kB
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# λ©”λͺ¨λ¦¬μ™€ 속도
λ©”λͺ¨λ¦¬ λ˜λŠ” 속도에 λŒ€ν•΄ πŸ€— Diffusers *μΆ”λ‘ *을 μ΅œμ ν™”ν•˜κΈ° μœ„ν•œ λͺ‡ 가지 기술과 아이디어λ₯Ό μ œμ‹œν•©λ‹ˆλ‹€.
일반적으둜, memory-efficient attention을 μœ„ν•΄ [xFormers](https://github.com/facebookresearch/xformers) μ‚¬μš©μ„ μΆ”μ²œν•˜κΈ° λ•Œλ¬Έμ—, μΆ”μ²œν•˜λŠ” [μ„€μΉ˜ 방법](xformers)을 보고 μ„€μΉ˜ν•΄ λ³΄μ„Έμš”.
λ‹€μŒ 섀정이 μ„±λŠ₯κ³Ό λ©”λͺ¨λ¦¬μ— λ―ΈμΉ˜λŠ” 영ν–₯에 λŒ€ν•΄ μ„€λͺ…ν•©λ‹ˆλ‹€.
| | μ§€μ—°μ‹œκ°„ | 속도 ν–₯상 |
| ---------------- | ------- | ------- |
| 별도 μ„€μ • μ—†μŒ | 9.50s | x1 |
| cuDNN auto-tuner | 9.37s | x1.01 |
| fp16 | 3.61s | x2.63 |
| Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹ | 3.30s | x2.88 |
| traced UNet | 3.21s | x2.96 |
| memory-efficient attention | 2.63s | x3.61 |
<em>
NVIDIA TITAN RTXμ—μ„œ 50 DDIM μŠ€ν…μ˜ "a photo of an astronaut riding a horse on mars" ν”„λ‘¬ν”„νŠΈλ‘œ 512x512 크기의 단일 이미지λ₯Ό μƒμ„±ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
</em>
## cuDNN auto-tuner ν™œμ„±ν™”ν•˜κΈ°
[NVIDIA cuDNN](https://developer.nvidia.com/cudnn)은 μ»¨λ³Όλ£¨μ…˜μ„ κ³„μ‚°ν•˜λŠ” λ§Žμ€ μ•Œκ³ λ¦¬μ¦˜μ„ μ§€μ›ν•©λ‹ˆλ‹€. AutotunerλŠ” 짧은 벀치마크λ₯Ό μ‹€ν–‰ν•˜κ³  주어진 μž…λ ₯ 크기에 λŒ€ν•΄ 주어진 ν•˜λ“œμ›¨μ–΄μ—μ„œ 졜고의 μ„±λŠ₯을 가진 컀널을 μ„ νƒν•©λ‹ˆλ‹€.
**μ»¨λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬**λ₯Ό ν™œμš©ν•˜κ³  있기 λ•Œλ¬Έμ— (λ‹€λ₯Έ μœ ν˜•λ“€μ€ ν˜„μž¬ μ§€μ›λ˜μ§€ μ•ŠμŒ), λ‹€μŒ 섀정을 톡해 μΆ”λ‘  전에 cuDNN autotunerλ₯Ό ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€:
```python
import torch
torch.backends.cudnn.benchmark = True
```
### fp32 λŒ€μ‹  tf32 μ‚¬μš©ν•˜κΈ° (Ampere 및 이후 CUDA μž₯μΉ˜λ“€μ—μ„œ)
Ampere 및 이후 CUDA μž₯μΉ˜μ—μ„œ ν–‰λ ¬κ³± 및 μ»¨λ³Όλ£¨μ…˜μ€ TensorFloat32(TF32) λͺ¨λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ 더 λΉ λ₯΄μ§€λ§Œ μ•½κ°„ 덜 μ •ν™•ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
기본적으둜 PyTorchλŠ” μ»¨λ³Όλ£¨μ…˜μ— λŒ€ν•΄ TF32 λͺ¨λ“œλ₯Ό ν™œμ„±ν™”ν•˜μ§€λ§Œ ν–‰λ ¬ κ³±μ…ˆμ€ ν™œμ„±ν™”ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
λ„€νŠΈμ›Œν¬μ— μ™„μ „ν•œ float32 정밀도가 ν•„μš”ν•œ κ²½μš°κ°€ μ•„λ‹ˆλ©΄ ν–‰λ ¬ κ³±μ…ˆμ— λŒ€ν•΄μ„œλ„ 이 섀정을 ν™œμ„±ν™”ν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.
μ΄λŠ” 일반적으둜 λ¬΄μ‹œν•  수 μžˆλŠ” 수치의 정확도 손싀이 μžˆμ§€λ§Œ, 계산 속도λ₯Ό 크게 높일 수 μžˆμŠ΅λ‹ˆλ‹€.
그것에 λŒ€ν•΄ [μ—¬κΈ°](https://huggingface.co/docs/transformers/v4.18.0/en/performance#tf32)μ„œ 더 읽을 수 μžˆμŠ΅λ‹ˆλ‹€.
μΆ”λ‘ ν•˜κΈ° 전에 λ‹€μŒμ„ μΆ”κ°€ν•˜κΈ°λ§Œ ν•˜λ©΄ λ©λ‹ˆλ‹€:
```python
import torch
torch.backends.cuda.matmul.allow_tf32 = True
```
## λ°˜μ •λ°€λ„ κ°€μ€‘μΉ˜
더 λ§Žμ€ GPU λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•˜κ³  더 λΉ λ₯Έ 속도λ₯Ό μ–»κΈ° μœ„ν•΄ λͺ¨λΈ κ°€μ€‘μΉ˜λ₯Ό λ°˜μ •λ°€λ„(half precision)둜 직접 뢈러였고 μ‹€ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
μ—¬κΈ°μ—λŠ” `fp16`μ΄λΌλŠ” λΈŒλžœμΉ˜μ— μ €μž₯된 float16 λ²„μ „μ˜ κ°€μ€‘μΉ˜λ₯Ό 뢈러였고, κ·Έ λ•Œ `float16` μœ ν˜•μ„ μ‚¬μš©ν•˜λ„λ‘ PyTorch에 μ§€μ‹œν•˜λŠ” μž‘μ—…μ΄ ν¬ν•¨λ©λ‹ˆλ‹€.
```Python
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
```
<Tip warning={true}>
μ–΄λ–€ νŒŒμ΄ν”„λΌμΈμ—μ„œλ„ [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) λ₯Ό μ‚¬μš©ν•˜λŠ” 것은 검은색 이미지λ₯Ό 생성할 수 있고, μˆœμˆ˜ν•œ float16 정밀도λ₯Ό μ‚¬μš©ν•˜λŠ” 것보닀 항상 느리기 λ•Œλ¬Έμ— μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.
</Tip>
## μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ 슬라이슀 μ–΄ν…μ…˜
μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄, ν•œ λ²ˆμ— λͺ¨λ‘ κ³„μ‚°ν•˜λŠ” λŒ€μ‹  λ‹¨κ³„μ μœΌλ‘œ 계산을 μˆ˜ν–‰ν•˜λŠ” 슬라이슀 λ²„μ „μ˜ μ–΄ν…μ…˜(attention)을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
<Tip>
Attention slicing은 λͺ¨λΈμ΄ ν•˜λ‚˜ μ΄μƒμ˜ μ–΄ν…μ…˜ ν—€λ“œλ₯Ό μ‚¬μš©ν•˜λŠ” ν•œ, 배치 크기가 1인 κ²½μš°μ—λ„ μœ μš©ν•©λ‹ˆλ‹€.
ν•˜λ‚˜ μ΄μƒμ˜ μ–΄ν…μ…˜ ν—€λ“œκ°€ μžˆλŠ” 경우 *QK^T* μ–΄ν…μ…˜ λ§€νŠΈλ¦­μŠ€λŠ” μƒλ‹Ήν•œ μ–‘μ˜ λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•  수 μžˆλŠ” 각 ν—€λ“œμ— λŒ€ν•΄ 순차적으둜 계산될 수 μžˆμŠ΅λ‹ˆλ‹€.
</Tip>
각 ν—€λ“œμ— λŒ€ν•΄ 순차적으둜 μ–΄ν…μ…˜ 계산을 μˆ˜ν–‰ν•˜λ €λ©΄, λ‹€μŒκ³Ό 같이 μΆ”λ‘  전에 νŒŒμ΄ν”„λΌμΈμ—μ„œ [`~StableDiffusionPipeline.enable_attention_slicing`]λ₯Ό ν˜ΈμΆœν•˜λ©΄ λ©λ‹ˆλ‹€:
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_attention_slicing()
image = pipe(prompt).images[0]
```
μΆ”λ‘  μ‹œκ°„μ΄ μ•½ 10% λŠλ €μ§€λŠ” μ•½κ°„μ˜ μ„±λŠ₯ μ €ν•˜κ°€ μžˆμ§€λ§Œ 이 방법을 μ‚¬μš©ν•˜λ©΄ 3.2GB μ •λ„μ˜ μž‘μ€ VRAMμœΌλ‘œλ„ Stable Diffusion을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€!
## 더 큰 배치λ₯Ό μœ„ν•œ sliced VAE λ””μ½”λ“œ
μ œν•œλœ VRAMμ—μ„œ λŒ€κ·œλͺ¨ 이미지 배치λ₯Ό λ””μ½”λ”©ν•˜κ±°λ‚˜ 32개 μ΄μƒμ˜ 이미지가 ν¬ν•¨λœ 배치λ₯Ό ν™œμ„±ν™”ν•˜κΈ° μœ„ν•΄, 배치의 latent 이미지λ₯Ό ν•œ λ²ˆμ— ν•˜λ‚˜μ”© λ””μ½”λ”©ν•˜λŠ” 슬라이슀 VAE λ””μ½”λ“œλ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
이λ₯Ό [`~StableDiffusionPipeline.enable_attention_slicing`] λ˜λŠ” [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`]κ³Ό κ²°ν•©ν•˜μ—¬ λ©”λͺ¨λ¦¬ μ‚¬μš©μ„ μΆ”κ°€λ‘œ μ΅œμ†Œν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
VAE λ””μ½”λ“œλ₯Ό ν•œ λ²ˆμ— ν•˜λ‚˜μ”© μˆ˜ν–‰ν•˜λ €λ©΄ μΆ”λ‘  전에 νŒŒμ΄ν”„λΌμΈμ—μ„œ [`~StableDiffusionPipeline.enable_vae_slicing`]을 ν˜ΈμΆœν•©λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_vae_slicing()
images = pipe([prompt] * 32).images
```
닀쀑 이미지 λ°°μΉ˜μ—μ„œ VAE λ””μ½”λ“œκ°€ μ•½κ°„μ˜ μ„±λŠ₯ ν–₯상이 μ΄λ£¨μ–΄μ§‘λ‹ˆλ‹€. 단일 이미지 λ°°μΉ˜μ—μ„œλŠ” μ„±λŠ₯ 영ν–₯은 μ—†μŠ΅λ‹ˆλ‹€.
<a name="sequential_offloading"></a>
## λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄ 가속 κΈ°λŠ₯을 μ‚¬μš©ν•˜μ—¬ CPU둜 μ˜€ν”„λ‘œλ”©
μΆ”κ°€ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•΄ κ°€μ€‘μΉ˜λ₯Ό CPU둜 μ˜€ν”„λ‘œλ“œν•˜κ³  순방ν–₯ 전달을 μˆ˜ν–‰ν•  λ•Œλ§Œ GPU둜 λ‘œλ“œν•  수 μžˆμŠ΅λ‹ˆλ‹€.
CPU μ˜€ν”„λ‘œλ”©μ„ μˆ˜ν–‰ν•˜λ €λ©΄ [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]λ₯Ό ν˜ΈμΆœν•˜κΈ°λ§Œ ν•˜λ©΄ λ©λ‹ˆλ‹€:
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt).images[0]
```
그러면 λ©”λͺ¨λ¦¬ μ†ŒλΉ„λ₯Ό 3GB 미만으둜 쀄일 수 μžˆμŠ΅λ‹ˆλ‹€.
참고둜 이 방법은 전체 λͺ¨λΈμ΄ μ•„λ‹Œ μ„œλΈŒλͺ¨λ“ˆ μˆ˜μ€€μ—μ„œ μž‘λ™ν•©λ‹ˆλ‹€. μ΄λŠ” λ©”λͺ¨λ¦¬ μ†ŒλΉ„λ₯Ό μ΅œμ†Œν™”ν•˜λŠ” κ°€μž₯ 쒋은 λ°©λ²•μ΄μ§€λ§Œ ν”„λ‘œμ„ΈμŠ€μ˜ 반볡적 νŠΉμ„±μœΌλ‘œ 인해 μΆ”λ‘  속도가 훨씬 λŠλ¦½λ‹ˆλ‹€. νŒŒμ΄ν”„λΌμΈμ˜ UNet ꡬ성 μš”μ†ŒλŠ” μ—¬λŸ¬ 번 μ‹€ν–‰λ©λ‹ˆλ‹€('num_inference_steps' 만큼). 맀번 UNet의 μ„œλ‘œ λ‹€λ₯Έ μ„œλΈŒλͺ¨λ“ˆμ΄ 순차적으둜 μ˜¨λ‘œλ“œλœ λ‹€μŒ ν•„μš”μ— 따라 μ˜€ν”„λ‘œλ“œλ˜λ―€λ‘œ λ©”λͺ¨λ¦¬ 이동 νšŸμˆ˜κ°€ λ§ŽμŠ΅λ‹ˆλ‹€.
<Tip>
또 λ‹€λ₯Έ μ΅œμ ν™” 방법인 <a href="#model_offloading">λͺ¨λΈ μ˜€ν”„λ‘œλ”©</a>을 μ‚¬μš©ν•˜λŠ” 것을 κ³ λ €ν•˜μ‹­μ‹œμ˜€. μ΄λŠ” 훨씬 λΉ λ₯΄μ§€λ§Œ λ©”λͺ¨λ¦¬ μ ˆμ•½μ΄ ν¬μ§€λŠ” μ•ŠμŠ΅λ‹ˆλ‹€.
</Tip>
λ˜ν•œ ttention slicingκ³Ό μ—°κ²°ν•΄μ„œ μ΅œμ†Œ λ©”λͺ¨λ¦¬(< 2GB)λ‘œλ„ λ™μž‘ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing(1)
image = pipe(prompt).images[0]
```
**μ°Έκ³ **: 'enable_sequential_cpu_offload()'λ₯Ό μ‚¬μš©ν•  λ•Œ, 미리 νŒŒμ΄ν”„λΌμΈμ„ CUDA둜 μ΄λ™ν•˜μ§€ **μ•ŠλŠ”** 것이 μ€‘μš”ν•©λ‹ˆλ‹€.그렇지 μ•ŠμœΌλ©΄ λ©”λͺ¨λ¦¬ μ†ŒλΉ„μ˜ 이득이 μ΅œμ†Œν™”λ©λ‹ˆλ‹€. 더 λ§Žμ€ 정보λ₯Ό μœ„ν•΄ [이 이슈](https://github.com/huggingface/diffusers/issues/1934)λ₯Ό λ³΄μ„Έμš”.
<a name="model_offloading"></a>
## λΉ λ₯Έ μΆ”λ‘ κ³Ό λ©”λͺ¨λ¦¬ λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ λͺ¨λΈ μ˜€ν”„λ‘œλ”©
[순차적 CPU μ˜€ν”„λ‘œλ”©](#sequential_offloading)은 이전 μ„Ήμ…˜μ—μ„œ μ„€λͺ…ν•œ κ²ƒμ²˜λŸΌ λ§Žμ€ λ©”λͺ¨λ¦¬λ₯Ό λ³΄μ‘΄ν•˜μ§€λ§Œ ν•„μš”μ— 따라 μ„œλΈŒλͺ¨λ“ˆμ„ GPU둜 μ΄λ™ν•˜κ³  μƒˆ λͺ¨λ“ˆμ΄ 싀행될 λ•Œ μ¦‰μ‹œ CPU둜 λ°˜ν™˜λ˜κΈ° λ•Œλ¬Έμ— μΆ”λ‘  속도가 λŠλ €μ§‘λ‹ˆλ‹€.
전체 λͺ¨λΈ μ˜€ν”„λ‘œλ”©μ€ 각 λͺ¨λΈμ˜ ꡬ성 μš”μ†ŒμΈ _modules_을 μ²˜λ¦¬ν•˜λŠ” λŒ€μ‹ , 전체 λͺ¨λΈμ„ GPU둜 μ΄λ™ν•˜λŠ” λŒ€μ•ˆμž…λ‹ˆλ‹€. 이둜 인해 μΆ”λ‘  μ‹œκ°„μ— λ―ΈμΉ˜λŠ” 영ν–₯은 λ―Έλ―Έν•˜μ§€λ§Œ(νŒŒμ΄ν”„λΌμΈμ„ 'cuda'둜 μ΄λ™ν•˜λŠ” 것과 λΉ„κ΅ν•˜μ—¬) μ—¬μ „νžˆ μ•½κ°„μ˜ λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
이 μ‹œλ‚˜λ¦¬μ˜€μ—μ„œλŠ” νŒŒμ΄ν”„λΌμΈμ˜ μ£Όμš” ꡬ성 μš”μ†Œ 쀑 ν•˜λ‚˜λ§Œ(일반적으둜 ν…μŠ€νŠΈ 인코더, unet 및 vae) GPU에 있고, λ‚˜λ¨Έμ§€λŠ” CPUμ—μ„œ λŒ€κΈ°ν•  κ²ƒμž…λ‹ˆλ‹€.
μ—¬λŸ¬ λ°˜λ³΅μ„ μœ„ν•΄ μ‹€ν–‰λ˜λŠ” UNetκ³Ό 같은 ꡬ성 μš”μ†ŒλŠ” 더 이상 ν•„μš”ν•˜μ§€ μ•Šμ„ λ•ŒκΉŒμ§€ GPU에 남아 μžˆμŠ΅λ‹ˆλ‹€.
이 κΈ°λŠ₯은 μ•„λž˜μ™€ 같이 νŒŒμ΄ν”„λΌμΈμ—μ„œ `enable_model_cpu_offload()`λ₯Ό ν˜ΈμΆœν•˜μ—¬ ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
image = pipe(prompt).images[0]
```
μ΄λŠ” 좔가적인 λ©”λͺ¨λ¦¬ μ ˆμ•½μ„ μœ„ν•œ attention slicing과도 ν˜Έν™˜λ©λ‹ˆλ‹€.
```Python
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing(1)
image = pipe(prompt).images[0]
```
<Tip>
이 κΈ°λŠ₯을 μ‚¬μš©ν•˜λ €λ©΄ 'accelerate' 버전 0.17.0 이상이 ν•„μš”ν•©λ‹ˆλ‹€.
</Tip>
## Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹ μ‚¬μš©ν•˜κΈ°
Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹μ€ 차원 μˆœμ„œλ₯Ό λ³΄μ‘΄ν•˜λŠ” λ©”λͺ¨λ¦¬μ—μ„œ NCHW ν…μ„œ 배열을 λŒ€μ²΄ν•˜λŠ” λ°©λ²•μž…λ‹ˆλ‹€.
Channels Last ν…μ„œλŠ” 채널이 κ°€μž₯ μ‘°λ°€ν•œ 차원이 λ˜λŠ” λ°©μ‹μœΌλ‘œ μ •λ ¬λ©λ‹ˆλ‹€(일λͺ… ν”½μ…€λ‹Ή 이미지λ₯Ό μ €μž₯).
ν˜„μž¬ λͺ¨λ“  μ—°μ‚°μž Channels Last ν˜•μ‹μ„ μ§€μ›ν•˜λŠ” 것은 μ•„λ‹ˆλΌ μ„±λŠ₯이 μ €ν•˜λ  수 μžˆμœΌλ―€λ‘œ, μ‚¬μš©ν•΄λ³΄κ³  λͺ¨λΈμ— 잘 μž‘λ™ν•˜λŠ”μ§€ ν™•μΈν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.
예λ₯Ό λ“€μ–΄ νŒŒμ΄ν”„λΌμΈμ˜ UNet λͺ¨λΈμ΄ channels Last ν˜•μ‹μ„ μ‚¬μš©ν•˜λ„λ‘ μ„€μ •ν•˜λ €λ©΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:
```python
print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1)
pipe.unet.to(memory_format=torch.channels_last) # in-place μ—°μ‚°
# 2번째 μ°¨μ›μ—μ„œ μŠ€νŠΈλΌμ΄λ“œ 1을 κ°€μ§€λŠ” (2880, 1, 960, 320)둜, 연산이 μž‘λ™ν•¨μ„ 증λͺ…ν•©λ‹ˆλ‹€.
print(pipe.unet.conv_out.state_dict()["weight"].stride())
```
## 좔적(tracing)
좔적은 λͺ¨λΈμ„ 톡해 예제 μž…λ ₯ ν…μ„œλ₯Ό 톡해 μ‹€ν–‰λ˜λŠ”λ°, ν•΄λ‹Ή μž…λ ₯이 λͺ¨λΈμ˜ λ ˆμ΄μ–΄λ₯Ό 톡과할 λ•Œ ν˜ΈμΆœλ˜λŠ” μž‘μ—…μ„ μΊ‘μ²˜ν•˜μ—¬ μ‹€ν–‰ 파일 λ˜λŠ” 'ScriptFunction'이 λ°˜ν™˜λ˜λ„λ‘ ν•˜κ³ , μ΄λŠ” just-in-time 컴파일둜 μ΅œμ ν™”λ©λ‹ˆλ‹€.
UNet λͺ¨λΈμ„ μΆ”μ ν•˜κΈ° μœ„ν•΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€:
```python
import time
import torch
from diffusers import StableDiffusionPipeline
import functools
# torch 기울기 λΉ„ν™œμ„±ν™”
torch.set_grad_enabled(False)
# λ³€μˆ˜ μ„€μ •
n_experiments = 2
unet_runs_per_experiment = 50
# μž…λ ₯ 뢈러였기
def generate_inputs():
sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
return sample, timestep, encoder_hidden_states
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
unet = pipe.unet
unet.eval()
unet.to(memory_format=torch.channels_last) # Channels Last λ©”λͺ¨λ¦¬ ν˜•μ‹ μ‚¬μš©
unet.forward = functools.partial(unet.forward, return_dict=False) # return_dict=False을 κΈ°λ³Έκ°’μœΌλ‘œ μ„€μ •
# μ›Œλ°μ—…
for _ in range(3):
with torch.inference_mode():
inputs = generate_inputs()
orig_output = unet(*inputs)
# 좔적
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")
# μ›Œλ°μ—… 및 κ·Έλž˜ν”„ μ΅œμ ν™”
for _ in range(5):
with torch.inference_mode():
inputs = generate_inputs()
orig_output = unet_traced(*inputs)
# λ²€μΉ˜λ§ˆν‚Ή
with torch.inference_mode():
for _ in range(n_experiments):
torch.cuda.synchronize()
start_time = time.time()
for _ in range(unet_runs_per_experiment):
orig_output = unet_traced(*inputs)
torch.cuda.synchronize()
print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
for _ in range(n_experiments):
torch.cuda.synchronize()
start_time = time.time()
for _ in range(unet_runs_per_experiment):
orig_output = unet(*inputs)
torch.cuda.synchronize()
print(f"unet inference took {time.time() - start_time:.2f} seconds")
# λͺ¨λΈ μ €μž₯
unet_traced.save("unet_traced.pt")
```
κ·Έ λ‹€μŒ, νŒŒμ΄ν”„λΌμΈμ˜ `unet` νŠΉμ„±μ„ λ‹€μŒκ³Ό 같이 μΆ”μ λœ λͺ¨λΈλ‘œ λ°”κΏ€ 수 μžˆμŠ΅λ‹ˆλ‹€.
```python
from diffusers import StableDiffusionPipeline
import torch
from dataclasses import dataclass
@dataclass
class UNet2DConditionOutput:
sample: torch.Tensor
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
# jitted unet μ‚¬μš©
unet_traced = torch.jit.load("unet_traced.pt")
# pipe.unet μ‚­μ œ
class TracedUNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.in_channels = pipe.unet.config.in_channels
self.device = pipe.unet.device
def forward(self, latent_model_input, t, encoder_hidden_states):
sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
return UNet2DConditionOutput(sample=sample)
pipe.unet = TracedUNet()
with torch.inference_mode():
image = pipe([prompt] * 1, num_inference_steps=50).images[0]
```
## Memory-efficient attention
μ–΄ν…μ…˜ λΈ”λ‘μ˜ λŒ€μ—­ν­μ„ μ΅œμ ν™”ν•˜λŠ” 졜근 μž‘μ—…μœΌλ‘œ GPU λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 크게 ν–₯μƒλ˜κ³  ν–₯μƒλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
@tridao의 κ°€μž₯ 졜근의 ν”Œλž˜μ‹œ μ–΄ν…μ…˜: [code](https://github.com/HazyResearch/flash-attention), [paper](https://arxiv.org/pdf/2205.14135.pdf).
배치 크기 1(ν”„λ‘¬ν”„νŠΈ 1개)의 512x512 크기둜 좔둠을 μ‹€ν–‰ν•  λ•Œ λͺ‡ 가지 Nvidia GPUμ—μ„œ 얻은 속도 ν–₯상은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
| GPU | κΈ°μ€€ μ–΄ν…μ…˜ FP16 | λ©”λͺ¨λ¦¬ 효율적인 μ–΄ν…μ…˜ FP16 |
|------------------ |--------------------- |--------------------------------- |
| NVIDIA Tesla T4 | 3.5it/s | 5.5it/s |
| NVIDIA 3060 RTX | 4.6it/s | 7.8it/s |
| NVIDIA A10G | 8.88it/s | 15.6it/s |
| NVIDIA RTX A6000 | 11.7it/s | 21.09it/s |
| NVIDIA TITAN RTX | 12.51it/s | 18.22it/s |
| A100-SXM4-40GB | 18.6it/s | 29.it/s |
| A100-SXM-80GB | 18.7it/s | 29.5it/s |
이λ₯Ό ν™œμš©ν•˜λ €λ©΄ λ‹€μŒμ„ λ§Œμ‘±ν•΄μ•Ό ν•©λ‹ˆλ‹€:
- PyTorch > 1.12
- Cuda μ‚¬μš© κ°€λŠ₯
- [xformers 라이브러리λ₯Ό μ„€μΉ˜ν•¨](xformers)
```python
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
with torch.inference_mode():
sample = pipe("a small cat")
# 선택: 이λ₯Ό λΉ„ν™œμ„±ν™” ν•˜κΈ° μœ„ν•΄ λ‹€μŒμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
# pipe.disable_xformers_memory_efficient_attention()
```