Questions about how Stable Diffusion works and was trained.

#2127
by MrSusanovo - opened

Just found out I probably posted this question in the wrong page, (was on CompVis/stable-diffusion-v1-4 page), copied to here.

I know I'm dumb enough to be completely lost in papers. But I didn't expect myself to be too dumb to understand the brief intro of state-of-art computer vision concepts. There's something I find quite confusing after reading and playing around with the example Colab notebook.

To my understanding, the general structure of the current stable diffusion pipeline consists of mainly 4 parts, a VAE, a UNet that does the reverse diffusion process in latent space, a text encoder trained on caption image pairs, and a scheduler.

My questions are:

  1. What does the scheduler do? Why not directly subtract the noise predicted after each inference step from the latent?
  2. Among these 4 parts, which are the ones that you trained? The notebook says the text encoder is off-the-shelf. I guess the latent UNet is trained. I'm not quite sure about VAE and schedulers (do they need to be trained or are they some fixed-steps algorithms?)
  3. Probably not too relevant here if the VAE is something need to be trained but off-the-shelf. If they need to be trained, how should we train them? Input and output should be equal dimension and the best result would be VAE just does nothing but produces original input. How do we make it "compress" the input but keep the important info effectively.
  4. About text encoders, can we switch it to some other model other than "openai/clip-vit-large-patch14" (say I want to support multi-lingual or a very specific category of text-image pair)? What would be the restrictions here? Or what are the constrains for making the text encoder or even the whole diffusion pipeline more modular.
  5. How does the cross-attention layer work if the UNnet takes 36464 latent but text encoder produces 277768 (I think it's always encoding an empty string for unconditional prompt, thus I guess it's 277768 here) output.
  6. If we simply want to denoise a picture, among the 4 parts (text encoder, VAE, UNet, scheduelr) which are still necessary and which don't?

Apologize for dumping a ton of dumb questions and for any duplicate questions, hugging face doesn't seem to provide a search function here, would appreciate any answer or links.

Last but not least, thank you for such amazing project and almost the best intro notebook I've ever seen.

Sign up or log in to comment