Overview
These are latent diffusion transformer models trained from scratch on 100k 256x256 images. Checkpoint 278k-full_state_dict.pth has been trained on about 500 epochs and is well into being overfit on the 100k training images.
The two checkpoints for 300k and 395k steps were further trained on a Midjourney dataset of 600k images for 9.4 epochs (300k steps) and 50 epochs (395k steps) at a constant LR of 5e-5. The additional training on the MJ dataset took ~8 hours on a 4090 with batch size 256.
The models are the same as in the Google Colabs below: embed_dim=512, n_layers=8, total parameters=30507328 (30M)
Run the Models in Colab
https://colab.research.google.com/drive/10yORcKXT40DLvZSceOJ1Hi5z_p5r-bOs?usp=sharing
Colab Training Notebook
https://colab.research.google.com/drive/1sKk0usxEF4bmdCDcNQJQNMt4l9qBOeAM?usp=sharing
Github Repo
This repo contains the original training code: https://github.com/apapiu/transformer_latent_diffusion
Datasets used
https://huggingface.co/apapiu/small_ldt/tree/main
Other
See this Reddit post by u/spring_m (huggingface.co/apapiu) for some more information: https://www.reddit.com/r/MachineLearning/comments/198eiv1/p_small_latent_diffusion_transformer_from_scratch/