Thrilled to introduce Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini can also achieve 49.5% higher throughput than AdamW on Llama2-7B pre-training.
The design of Adam-mini is inspired by certain Hessian structures we observed on Transformers.
Feel free to try it out! Try switching to Adam-mini with the same hyperparams of AdamW, it would work with only half memory. Hope Adam-mini can help save time, cost, and energy in your tasks!