Post
1486
nanoGPT with Sigmoid Self-Attention
I couldn’t resist had to give it a try:)
Some observations on M2:
SSA was ~5-10% faster in training with similar final loss values, slightly less coherent text generation, marginally higher perplexity, and lower memory usage compared to softmax.
Code: https://github.com/Jaykef/ai-algorithms/blob/main/sigmoid_attn.ipynb
I couldn’t resist had to give it a try:)
Some observations on M2:
SSA was ~5-10% faster in training with similar final loss values, slightly less coherent text generation, marginally higher perplexity, and lower memory usage compared to softmax.
Code: https://github.com/Jaykef/ai-algorithms/blob/main/sigmoid_attn.ipynb