diffusers-internal-dev

Enterprise

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

sayakpaul updated a model 3 days ago

diffusers-internal-dev/nano-banana-modular

sayakpaul published a model 3 days ago

diffusers-internal-dev/nano-banana-modular

sayakpaul updated a model 15 days ago

diffusers-internal-dev/gemini-prompt-expander

View all activity

sayakpaul

updated a model 3 days ago

diffusers-internal-dev/nano-banana-modular

Updated 3 days ago

sayakpaul

published a model 3 days ago

diffusers-internal-dev/nano-banana-modular

Updated 3 days ago

sayakpaul

updated a model 15 days ago

diffusers-internal-dev/gemini-prompt-expander

Updated 15 days ago • 1

YiYiXu

updated a model 25 days ago

diffusers-internal-dev/ideogram-character-generator

Updated 25 days ago

YiYiXu

published a model 26 days ago

diffusers-internal-dev/ideogram-character-generator

Updated 25 days ago

sayakpaul

published a model 27 days ago

diffusers-internal-dev/gemini-prompt-expander

Updated 15 days ago • 1

sayakpaul

updated a model 27 days ago

diffusers-internal-dev/canny-filtering

Updated 27 days ago

sayakpaul

published a model 27 days ago

diffusers-internal-dev/canny-filtering

Updated 27 days ago

sayakpaul

updated a Space 28 days ago

Diffusers To GGUF

💻

Convert diffusers-format model checkpoints to GGUF.

sayakpaul

published a Space 28 days ago

Diffusers To GGUF

💻

Convert diffusers-format model checkpoints to GGUF.

a-r-r-o-w

posted an update about 1 month ago

Post

2149

You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead.

In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.

Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.

The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!

Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8

(Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)

3 replies

sayakpaul

updated a model about 1 month ago

diffusers-internal-dev/modular-flux.1-dev

Updated Jul 29

sayakpaul

published a model about 1 month ago

diffusers-internal-dev/modular-flux.1-dev

Updated Jul 29

a-r-r-o-w

updated a model about 1 month ago

diffusers-internal-dev/Modular-Wan-I2V-14B-720P-Diffusers

Updated Jul 28

a-r-r-o-w

published a model about 1 month ago

diffusers-internal-dev/Modular-Wan-I2V-14B-720P-Diffusers

Updated Jul 28

a-r-r-o-w

updated a model about 1 month ago

diffusers-internal-dev/Modular-Wan-I2V-14B-480P-Diffusers

Updated Jul 28

a-r-r-o-w

published a model about 1 month ago

diffusers-internal-dev/Modular-Wan-I2V-14B-480P-Diffusers

Updated Jul 28

sayakpaul

posted an update about 1 month ago

Post

1165

Fast LoRA inference for Flux with Diffusers and PEFT 🚨

There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.

In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:

1. torch.compile
2. Flash Attention 3 (when compatible)
3. Dynamic FP8 weight quantization (when compatible)
4. Hotswapping for avoiding recompilation during swapping new LoRAs 🤯

We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community 🤗

Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.

Learn the details and the full code here:
https://huggingface.co/blog/lora-fast

a-r-r-o-w

updated a model about 1 month ago

diffusers-internal-dev/modular-wan-t2v

Updated Jul 23

a-r-r-o-w

published a model about 1 month ago

diffusers-internal-dev/modular-wan-t2v

Updated Jul 23

AI & ML interests

Recent Activity

Team members 7

diffusers-internal-dev's activity

Diffusers To GGUF

Diffusers To GGUF