Huge memory consumption with SD3.5-medium
According to the picture here, SD3.5-medium should work fine on 10GB vRAM
https://stability.ai/news/introducing-stable-diffusion-3-5
However, my test program fails on a g4dn.xlarge AWS instance, it has 4C/16G + 48G swap, and a Tesla T4 CPU with 16GB vRAM. It runs out of memory due to CUDA couldn't allocate more memory. From nvidia-smi it already took ~15GB memory, and couldn't complete even one picture.
I'm wondering what's wrong here?
Attached fill source code.
import os
import json
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("./stable-diffusion-3.5-medium/")
if torch.cuda.is_available():
print('use cuda')
pipe = pipe.to("cuda")
elif torch.mps.is_available():
print('use mps')
pipe = pipe.to('mps')
else:
print('use cpu')
data = []
with open('data.json', 'r') as f:
data = json.load(f)
os.makedirs('output', exist_ok=True)
for row in data:
prompt = '%s, style is %s, light is %s' % (row['prompt'], row['style'], row['light'])
filename = 'output/%s.png' % (row['uuid'])
height = 1280
width = 1280
if row['aspect_ratio'] == '16:9':
width = 720
elif row['aspect_ratio'] == '9:16':
width = 720
height = 1280
print('saving', filename)
image = pipe(prompt, height=height, width=width).images[0]
image.save(filename)
did it resolve for you
@yue32000
@oddball516
The reason is because of the T5 text encoder, you can resolve it with
pipe.enable_model_cpu_offload()
@YaTharThShaRma999 Do you know how enable_model_cpu_offload() works? Are you saying the T5 model will be offloaded to non-gpu memory?
@oddball516 yeah kinda, when it’s needed however, it will be moved back to gpu for faster computation. After it’s done computing(1-2s), it will be moved back to cpu.
It’s very big, infact bigger then the real image gen model(4b vs 2b) itself but only used one time per image and is fast.
I want to be the best
Weird, it failed on T4 with anothe rerror.
Traceback (most recent call last):
File "/home/diffusers/main.py", line 12, in <module>
pipe = DiffusionPipeline.from_pretrained(
File "/home/diffusers/venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/diffusers/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 881, in from_pretrained
loaded_sub_model = load_sub_model(
File "/home/diffusers/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_loading_utils.py", line 703, in load_sub_model
loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
File "/home/diffusers/venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/diffusers/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 757, in from_pretrained
unexpected_keys = load_model_dict_into_meta(
File "/home/diffusers/venv/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 154, in load_model_dict_into_meta
raise ValueError(
ValueError: Cannot load /root/.cache/huggingface/hub/models--stabilityai--stable-diffusion-3.5-medium/snapshots/b940f670f0eda2d07fbb75229e779da1ad11eb80/transformer because transformer_blocks.0.norm1.linear.bias expected shape tensor(..., device='meta', size=(9216,)), but got torch.Size([13824]). If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example.