File size: 2,217 Bytes
0b3f637
 
4e2f143
 
0b3f637
4e2f143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: creativeml-openrail-m
library_name: diffusers
pipeline_tag: text-to-image
---
# SD 1.5 Big G (alpha)

This is a Stable Diffusion 1.5 model, but it uses the [CLIP Big G](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) text encoder instead of the original [CLIP-L](https://huggingface.co/openai/clip-vit-large-patch14) text encoder. 
This is just a knowledge transfer pre-train with the goal of preserving the current knowledge of the model. 
It was only trained using student/teacher training from my [SD 1.5 fine tune, Objective Reality v2](https://huggingface.co/ostris/objective-reality). 
To fully realize the full potential of the much larger text encoder, it would need to be further fine tuned on a large dataset.

# Examples

Coming soon

# Usage

For diffusers, you can use it like any other stable diffusion model. 

```python
from diffusers import StableDiffusionPipeline
import torch

model_id = "ostris/sd15-big-g-alpha"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
    
image.save("astronaut_rides_horse.png")
```

It will not work out of the box with Comfy UI or Auto1111. There would need to be special code to load it. If there is any interest in this model, I may work on compatibility.
Overall, it won't be hard to add. The only architecture change is the text encoder the and cross attention weights.

# Alpha

This is just a pretrained alpha. There are some concepts that did not seem to transfer. It really needs proper training on a large dataset. Anyone is welcome to take this task on. I do not plan to at the time.

# Why make this?

In the words of George Mallory, "Because it's there"

# Training Method

As mentioned above, it was trained using student/teacher only. This was an iterative process over the corse of a few months, and I did not keep track of all of the exact numbers. The following are best estimates. 

The cross attention layers were trained for 1-2 million steps with a batch size of 8 on a single 4090 GPU. Then the full unet was trained for around 100k steps with the same settings.