Hunyuan video LoRA training study (Single image/style training)

Community Article Published January 27, 2025

This is a follow-up to my LTX-Video LoRa training study

I finally got around to setting things up for Hunyuan inference and training and put my poor 3090 to work.

Generation in comfy takes about 17GB of VRAM and training was also around 18GB.

These are all images from a pass of 400 steps, or 100 epochs, taking 1h37 minutes to complete. About half the time of the 4000 steps of LTX LoRa training.

This is not as extensive as the LTX post due to the time it takes to generate (LTX clearly wins there), but on other accounts, I'm leaning towards Hunyuan.

Training was done using diffusers with finetrainers as backend

finetrainers-ui as gui (my own project)

Inference was done with ComfyUI core nodes, applying this workaround to allow for loading the loras.

Lora is available here: https://huggingface.co/neph1/hunyuan_night_graveyard

Dataset image:

no lora

No lora

no lora

Looks pretty nice, actually. But not like my dataset, which is good.

400 training steps

Lora strength variation, 4.0 guidance, 20 inference steps, (0.60, 0.8, 1.0)

0.6 0.8 1.0

Guidance variation, 20 inference steps, (2.0, 4.0, 6.0)

2.0 4.0 6.0

Lowering the guidance gives it a more natural look, which might be expected since the style of the lora is not realistic

Steps variation, 6.0 guidance, (15, 20, 25)

15 20 25

Hmm, not a lot of difference between 20 and 25, but a bit crisper. Not worth the extra minutes for the 25 steps? 15 steps - 164s. 20 steps - 217s. 25 steps - 262s

Flexibility, prompt variations, 4.0 guidance, 20 steps, strength 1.0

Here I'm trying to prompt for things not in the dataset. The results are underwhelming, maybe because of training steps, or strength

Shining a flashlight through the fog "Shining a flashlight through the fog"

Giant human silhouette seen in the distance "Giant human silhouette seen in the distance"

Training steps, 4.0 guidance, 20 steps, 1.0 strength (200 steps, 400 steps)

My main focus was on the 400 steps version, but I wanted to test this too (updated later) to see if 400 steps were necessary.

200 steps 400 steps

200 looks OK. It's similar to lower strength on the 400 steps version (which is expected, I guess)

Full config used. This can be loaded in finetuners-ui:

accelerate_config: uncompiled_1.yaml allow_tf32: true batch_size: 1 beta1: 0.9 beta2: 0.95 caption_column: prompts.txt caption_dropout_p: 0.05 caption_dropout_technique: empty checkpointing_limit: 3 checkpointing_steps: 200 data_root: '' dataloader_num_workers: 0 dataset_file: '' diffusion_options: '' enable_model_cpu_offload: '' enable_slicing: true enable_tiling: true epsilon: 1e-8 gpu_ids: '0' gradient_accumulation_steps: 8 gradient_checkpointing: true id_token: afkx image_resolution_buckets: 512x512 layerwise_upcasting_modules: transformer layerwise_upcasting_skip_modules_pattern: patch_embed pos_embed x_embedder context_embedder ^proj_in$ ^proj_out$ norm layerwise_upcasting_storage_dtype: float8_e5m2 lora_alpha: 64 lr: 0.0002 lr_num_cycles: 1 lr_scheduler: linear lr_warmup_steps: 100 max_grad_norm: 1 model_name: hunyuan_video nccl_timeout: 1800 num_validation_videos: 0 optimizer: adamw output_dir: '' pin_memory: true precompute_conditions: true pretrained_model_name_or_path: '' rank: 64 report_to: none resume_from_checkpoint: '' seed: 425 target_modules: to_q to_k to_v to_out.0 text_encoder_2_dtype: bf16 text_encoder_3_dtype: bf16 text_encoder_dtype: bf16 tracker_name: finetrainers train_steps: 3000 training_type: lora transformer_dtype: bf16 use_8bit_bnb: '' vae_dtype: bf16 validation_epochs: 0 validation_prompt_separator: ':::' validation_prompts: '' validation_steps: 10000 video_column: videos.txt video_resolution_buckets: 1x512x512 weight_decay: 0.001

Community

Sign up or log in to comment