Increasing Speed

Integrate Flash Attention 2.0 cuda, significant speed up
Utilize 8BIT Optimizer from BNB, big speed up weakness => bnb isn't compatible with all gpus
Use a better tokenizer TokenMonster?
Parallelize the transformer blocks similar to that of PALMS
Look into MPTS config for LION for pretraining, did they use high batch size?