What minimal VRAM does it require?

#18
by DrNicefellow - opened

As title. Does 3090 work?

As far as I know, it's around 6GB for Int8 mode, and 12GB for stock/default mode.

@Hashnimo Really? I got a 3090. I tried the following:
python inference.py --ckpt_dir './LTX-Video-model' --prompt "A woman with light skin, wearing a blue jacket and a black hat with a veil, looks down and to her right, then back up as she speaks; she has brown hair styled in an updo, light brown eyebrows, and is wearing a white collared shirt under her jacket; the camera remains stationary on her face as she speaks; the background is out of focus, but shows trees and people in period clothing; the scene is captured in real-life footage." --height 512 --width 512 --num_frames 128 --seed 0

Then all 24GB VRAM is consumed. So, it's defly more than 24GB. Where you get the idea of 12GB?

I was able to make the text to video workflow run on a 12.2GiB GPU using ComfyUI

But in the model loader node I am loading bf16 model, but loading it as float16 (I had to modify the code a little)
I also had to change bfloat16 in the VAE decoder to float16.

You may not have to change the data types if your GPU supports bfloat16 directly.

p.s. I also used --lowvram --force-fp16 (not sure if necessary).

I was able to make the text to video workflow run on a 12.2GiB GPU using ComfyUI

But in the model loader node I am loading bf16 model, but loading it as float16 (I had to modify the code a little)
I also had to change bfloat16 in the VAE decoder to float16.

You may not have to change the data types if your GPU supports bfloat16 directly.

p.s. I also used --lowvram --force-fp16 (not sure if necessary).

that's great! Could you share the comfy ui workflow that takes 12gb vram?

I was able to make the text to video workflow run on a 12.2GiB GPU using ComfyUI

But in the model loader node I am loading bf16 model, but loading it as float16 (I had to modify the code a little)
I also had to change bfloat16 in the VAE decoder to float16.

You may not have to change the data types if your GPU supports bfloat16 directly.

p.s. I also used --lowvram --force-fp16 (not sure if necessary).

that's great! Could you share the comfy ui workflow that takes 12gb vram?
bash```
python main.py --listen 0.0.0.0 --port 8888 --lowvram --force-fp16 --fp16-unet # CLI

I didn't change the flow parameters. Just the prompt and the two files as you see below.

model: ltx-2b-v0.9-bf16.safetensors
dtype: torch.float16

diff --git a/loader_node.py b/loader_node.py
index 496e7bb..5191689 100644
--- a/loader_node.py
+++ b/loader_node.py
@@ -29,7 +29,7 @@ class LTXVLoader:
                     folder_paths.get_filename_list("checkpoints"),
                     {"tooltip": "The name of the checkpoint (model) to load."},
                 ),
-                "dtype": (["bfloat16", "float32"], {"default": "float32"}),
+                "dtype": (["float16", "bfloat16", "float32"], {"default": "float32"}),
             }
         }

@@ -41,7 +41,7 @@ class LTXVLoader:
     OUTPUT_NODE = False

     def load(self, ckpt_name, dtype):
-        dtype_map = {"bfloat16": torch.bfloat16, "float32": torch.float32}
+        dtype_map = {"float16": torch.float16, "bfloat16": torch.bfloat16, "float32": torch.float32}
         load_device = comfy.model_management.get_torch_device()
         offload_device = comfy.model_management.unet_offload_device()

diff --git a/model.py b/model.py
index 555d26b..75c5d76 100644
--- a/model.py
+++ b/model.py
@@ -111,7 +111,7 @@ class LTXVTransformer3D(nn.Module):

         l = latent_patchified
         if mixed_precision:
-            context_manager = torch.autocast("cuda", dtype=torch.bfloat16)
+            context_manager = torch.autocast("cuda", dtype=torch.float16)
         else:
             context_manager = nullcontext()
         with context_manager:
diff --git a/vae.py b/vae.py
index 0a39c20..3b7b5f2 100644
--- a/vae.py
+++ b/vae.py
@@ -27,7 +27,7 @@ class LTXVVAE(comfy.sd.VAE):
         self.offload_device = comfy.model_management.vae_offload_device()

     @classmethod
-    def from_pretrained(cls, vae_class, model_path, dtype=torch.bfloat16):
+    def from_pretrained(cls, vae_class, model_path, dtype=torch.float16):
         instance = cls()
         model = vae_class.from_pretrained(
             pretrained_model_name_or_path=model_path,
@@ -40,7 +40,7 @@ class LTXVVAE(comfy.sd.VAE):

     @classmethod
     def from_config_and_state_dict(
-        cls, vae_class, config, state_dict, dtype=torch.bfloat16
+        cls, vae_class, config, state_dict, dtype=torch.float16
     ):
         instance = cls()
         model = vae_class.from_config(config)
@@ -79,7 +79,7 @@ class LTXVVAE(comfy.sd.VAE):
         preprocessed = self.image_processor.preprocess(
             pixel_samples.permute(3, 0, 1, 2)
         )
-        input = preprocessed.unsqueeze(0).to(torch.bfloat16).to(self.device)
+        input = preprocessed.unsqueeze(0).to(torch.float16).to(self.device)
         latents = vae_encode(
             input, self.first_stage_model, vae_per_channel_normalize=True
         ).to(comfy.model_management.get_torch_device())

  • Change the model in line number 311 from "PixArt-alpha/PixArt-XL-2-1024-MS" to "Lightricks/T5-XXL-8bit".
  • Download the 4 tokenizer files from "google/flan-t5-xxl" and use them in line 316.
  • Run the inference, comment out the lines of code that throw the CUDA errors (follow error instruction on the screen)

After making above changes, model runs of my 3090.

  • Change the model in line number 311 from "PixArt-alpha/PixArt-XL-2-1024-MS" to "Lightricks/T5-XXL-8bit".
  • Download the 4 tokenizer files from "google/flan-t5-xxl" and use them in line 316.
  • Run the inference, comment out the lines of code that throw the CUDA errors (follow error instruction on the screen)

After making above changes, model runs of my 3090.

How fast does the model run on your 3090 at int8?

Lightricks org

Hi! Yesterday someone managed to run a quantized version on a laptop with 4060 which has only 8 GB of VRAM
check it out here: https://www.reddit.com/r/StableDiffusion/comments/1h79ks2/fast_ltx_video_on_rtx_4060_and_other_ada_gpus/

It also seems like the minimum requirements go down everyday :)

@Shecht-ltx That's awesome, thank you!

Hi! Yesterday someone managed to run a quantized version on a laptop with 4060 which has only 8 GB of VRAM
check it out here: https://www.reddit.com/r/StableDiffusion/comments/1h79ks2/fast_ltx_video_on_rtx_4060_and_other_ada_gpus/

It also seems like the minimum requirements go down everyday :)

My graphics card is a 4060 Ti with 8GB of VRAM. Sometimes it works, but sometimes I run out of memory.

Sign up or log in to comment