diffusers-sdxl-controlnet / PHILOSOPHY.md
svjack's picture
Upload 1392 files
43b7e92 verified

Philosophy

🧨 Diffusers provides state-of-the-art pretrained diffusion models across multiple modalities. Its purpose is to serve as a modular toolbox for both inference and training.

We aim at building a library that stands the test of time and therefore take API design very seriously.

In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefore, most of our design choices are based on PyTorch's Design Principles. Let's go over the most important ones:

Usability over Performance

  • While Diffusers has many built-in performance-enhancing features (see Memory and Speed), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library.
  • Diffusers aims to be a light-weight package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as accelerate, safetensors, onnx, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages.
  • Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired.

Simple over easy

As PyTorch states, explicit is better than implicit and simple is better than complex. This design philosophy is reflected in multiple parts of the library:

  • We follow PyTorch's API with methods like DiffusionPipeline.to to let the user handle device management.
  • Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible.
  • Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers.
  • Separately trained components of the diffusion pipeline, e.g. the text encoder, the UNet, and the variational autoencoder, each has their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline.

Tweakable, contributor-friendly over abstraction

For large parts of the library, Diffusers adopts an important design principle of the Transformers library, which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as Don't repeat yourself (DRY). In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers. Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable. However, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because:

  • Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions.
  • Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions.
  • Open-source libraries rely on community contributions and therefore must build a library that is easy to contribute to. The more abstract the code, the more dependencies, the harder to read, and the harder to contribute to. Contributors simply stop contributing to very abstract libraries out of fear of breaking vital functionality. If contributing to a library cannot break other fundamental code, not only is it more inviting for potential new contributors, but it is also easier to review and contribute to multiple parts in parallel.

At Hugging Face, we call this design the single-file policy which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look at this blog post.

In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such as DDPM, Stable Diffusion, unCLIP (DALL·E 2) and Imagen all rely on the same diffusion model, the UNet.

Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗. We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️ to hear it directly on GitHub.

Design Philosophy in Details

Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: pipelines, models, and schedulers. Let's walk through more detailed design decisions for each class.

Pipelines

Pipelines are designed to be easy to use (therefore do not follow Simple over easy 100%), are not feature complete, and should loosely be seen as examples of how to use models and schedulers for inference.

The following design principles are followed:

  • Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for src/diffusers/pipelines/stable-diffusion. If pipelines share similar functionality, one can make use of the #Copied from mechanism.
  • Pipelines all inherit from [DiffusionPipeline].
  • Every pipeline consists of different model and scheduler components, that are documented in the model_index.json file, are accessible under the same name as attributes of the pipeline and can be shared between pipelines with DiffusionPipeline.components function.
  • Every pipeline should be loadable via the DiffusionPipeline.from_pretrained function.
  • Pipelines should be used only for inference.
  • Pipelines should be very readable, self-explanatory, and easy to tweak.
  • Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs.
  • Pipelines are not intended to be feature-complete user interfaces. For future complete user interfaces one should rather have a look at InvokeAI, Diffuzers, and lama-cleaner.
  • Every pipeline should have one and only one way to run it via a __call__ method. The naming of the __call__ arguments should be shared across all pipelines.
  • Pipelines should be named after the task they are intended to solve.
  • In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file.

Models

Models are designed as configurable toolboxes that are natural extensions of PyTorch's Module class. They only partly follow the single-file policy.

The following design principles are followed:

  • Models correspond to a type of model architecture. E.g. the [UNet2DConditionModel] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context.
  • All models can be found in src/diffusers/models and every model architecture shall be defined in its file, e.g. unet_2d_condition.py, transformer_2d.py, etc...
  • Models do not follow the single-file policy and should make use of smaller model building blocks, such as attention.py, resnet.py, embeddings.py, etc... Note: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy.
  • Models intend to expose complexity, just like PyTorch's Module class, and give clear error messages.
  • Models all inherit from ModelMixin and ConfigMixin.
  • Models can be optimized for performance when it doesn’t demand major code changes, keep backward compatibility, and give significant memory or compute gain.
  • Models should by default have the highest precision and lowest performance setting.
  • To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
  • Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, e.g. it is usually better to add string "...type" arguments that can easily be extended to new future types instead of boolean is_..._type arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
  • The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and readable long-term, such as UNet blocks and Attention processors.

Schedulers

Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the single-file policy.

The following design principles are followed:

  • All schedulers are found in src/diffusers/schedulers.
  • Schedulers are not allowed to import from large utils files and shall be kept very self-contained.
  • One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper).
  • If schedulers share similar functionalities, we can make use of the #Copied from mechanism.
  • Schedulers all inherit from SchedulerMixin and ConfigMixin.
  • Schedulers can be easily swapped out with the ConfigMixin.from_config method as explained in detail here.
  • Every scheduler has to have a set_num_inference_steps, and a step function. set_num_inference_steps(...) has to be called before every denoising process, i.e. before step(...) is called.
  • Every scheduler exposes the timesteps to be "looped over" via a timesteps attribute, which is an array of timesteps the model will be called upon.
  • The step(...) function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).
  • Given the complexity of diffusion schedulers, the step function does not expose all the complexity and can be a bit of a "black box".
  • In almost all cases, novel schedulers shall be implemented in a new scheduling file.