File size: 7,197 Bytes
a8ad3eb
 
 
1bc181e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: mit
---
![](https://user-images.githubusercontent.com/61938694/231021615-38df0a0a-d97e-4f7a-99d9-99952357b4b1.png)
## Paella
We are releasing a new Paella model which builds on top of our initial paper https://arxiv.org/abs/2211.07292.
Paella is a text-to-image model that works in a quantized latent space and learns similarly to MUSE and Diffusion models.
Since the paper-release we worked intensively to bring Paella to a similar level as other 
state-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not
to make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer
to people outside the field on a technical basis. For example, many models have codebases with many thousand lines of 
code, that make it pretty hard for people to dive into the code and easily understand it. And that is the contribution
we are the most with Paella. The training and sampling code for Paella is minimalistic and can be understood in a few
minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire
sampling code can be written in just **12 lines** of code.

### How does Paella work?
Paella works in a quantized latent space, just like StableDiffusion etc., to reduce the computational power needed.
Images will be encoded to a smaller latent space and converted to visual tokens of shape *h x w*. Now during training,
these visual tokens will be noised, by replacing a random amount of tokens with other randomly selected tokens
from the codebook of the VQGAN. The noised image will be given to the model, along with a timestep and the conditional
information, which is text in our case. The model is tasked to predict the un-noised version of the tokens. 
And that's it. The model is optimized with the CrossEntropy loss between the original tokens and the predicted tokens.
The amount of noise added during the training is just a linear schedule, meaning that we uniformly sample a percentage 
between 0 and 100% and noise that amount of tokens.<br><br>

<figure>
  <img src="https://user-images.githubusercontent.com/61938694/231248435-d21170c1-57b4-4a8f-90a6-62cf3e7effcd.png" width="400">
  <figcaption>Images are noised and then fed to the model during training.</figcaption>
</figure>


Sampling is also extremely simple, we start with the entire image being random tokens. Then we feed the latent image, 
the timestep and the condition into the model and let it predict the final image. The models outputs a distribution
over every token, which we sample from with standard multinomial sampling.  
Since there are infinite possibilities for the result to look like, just doing a single step results in very basic 
shapes without any details. That is why we add noise to the image again and feed it back to the model. And we repeat
that process for a number of times with less noise being added every time and slowly get our final image.
You can see how images emerge [here](https://user-images.githubusercontent.com/61938694/231252449-d9ac4d15-15ef-4aed-a0de-91fa8746a415.png).<br>
The following is the entire sampling code needed to generate images:
```python
def sample(model_inputs, latent_shape, unconditional_inputs, steps=12, renoise_steps=11, temperature=(0.7, 0.3), cfg=8.0):
    with torch.inference_mode():
        sampled = torch.randint(0, model.num_labels, size=latent_shape)
        initial_noise = sampled.clone()
        timesteps = torch.linspace(1.0, 0.0, steps+1)
        temperatures = torch.linspace(temperature[0], temperature[1], steps)
        for i, t in enumerate(timesteps[:steps]):
            t = torch.ones(latent_shape[0]) * t

            logits = model(sampled, t, **model_inputs)
            if cfg:
                logits = logits * cfg + model(sampled, t, **unconditional_inputs) * (1-cfg)
            sampled = logits.div(temperatures[i]).softmax(dim=1).permute(0, 2, 3, 1).reshape(-1, logits.size(1))
            sampled = torch.multinomial(sampled, 1)[:, 0].view(logits.size(0), *logits.shape[2:])

            if i < renoise_steps:
                t_next = torch.ones(latent_shape[0]) * timesteps[i+1]
                sampled = model.add_noise(sampled, t_next, random_x=initial_noise)[0]
    return sampled
```

### Results
<img src="https://user-images.githubusercontent.com/61938694/231598512-2410c172-5a9d-43f4-947c-6ff7eaee77e7.png">
Since Paella is also conditioned on CLIP image embeddings the following things are also possible:<br><br>
<img src="https://user-images.githubusercontent.com/61938694/231278319-16551a8d-bfd1-49c9-b604-c6da3955a6d4.png">
<img src="https://user-images.githubusercontent.com/61938694/231287637-acd0b9b2-90c7-4518-9b9e-d7edefc6c3af.png">
<img src="https://user-images.githubusercontent.com/61938694/231287119-42fe496b-e737-4dc5-8e53-613bdba149da.png">

### Technical Details.
Model-Architecture: U-Net (Mix of....) <br>
Dataset: Laion-A, Laion Aesthetic > 6.0 <br>
Training Steps: 1.3M <br>
Batch Size: 2048 <br>
Resolution: 256 <br>
VQGAN Compression: f4 <br>
Condition: ByT5-XL (95%), CLIP-H Image Embedding (10%), CLIP-H Text Embedding (10%)
Optimizer: AdamW
Hardware: 128 A100 @ 80GB <br>
Training Time: ~3 weeks <br>
Learning Rate: 1e-4 <br>
More details on the approach, training and sampling can be found in paper and on GitHub.

### Paper, Code Release
Paper: https://arxiv.org/abs/2211.07292 <br>
Code: https://github.com/dome272/Paella <br>

### Goal
So you see, there are no heavy math formulas or theorems needed to achieve good sampling qualities. Moreover,
there are no constants such as alpha, beta, alpha_cum_prod etc. necessary as in diffusion models. This makes this
method really suitable for people new to the field of generative AI. We hope we can set the foundation for further 
research in that direction and hope to contribute to a world where AI is accessible and can be understood by everyone. 

### Limitations & Conclusion
There are still many things to improve for Paella to get on par with standard diffusion models or to even outperform
them. One primary thing we notice is that even though we only condition the model on CLIP image embedding 10% of the
time, during inference the model heavily relies on the generated image embeddings by a prior model (mapping clip text
embeddings to image embeddings as proposed in Dalle2). We counteract this by decreasing the importance of the image
embeddings by reweighing the attention scores. There probably is a way to avoid this happening already in training.
Other limitations such as lack of composition, text depiction, unawareness of concepts etc. could also be reduced by
continuing the training for longer. As a reference, Paella has only seen as many images as SD 1.4 and due to earlier
To conclude, this is still work in progress, but our first model that works a million times better than the first
versions we trained months ago. We hope that more people become interested in this approach, since we believe it has 
a lot of potential to become much better than this and to enable new people to have an easy-to-understand introduction
to the field of generative AI.