Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update Oct 14
Post
717
โšก๏ธ ๐“๐ก๐ข๐ฌ ๐ฆ๐จ๐ง๐ญ๐ก'๐ฌ ๐ฆ๐จ๐ฌ๐ญ ๐ข๐ฆ๐ฉ๐จ๐ซ๐ญ๐š๐ง๐ญ ๐›๐ซ๐ž๐š๐ค๐ญ๐ก๐ซ๐จ๐ฎ๐ ๐ก: ๐ƒ๐ข๐Ÿ๐Ÿ๐ž๐ซ๐ž๐ง๐ญ๐ข๐š๐ฅ ๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ ๐ฏ๐š๐ฌ๐ญ๐ฅ๐ฒ ๐ข๐ฆ๐ฉ๐ซ๐จ๐ฏ๐ž๐ฌ ๐š๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง โ‡’ ๐›๐ž๐ญ๐ญ๐ž๐ซ ๐ซ๐ž๐ญ๐ซ๐ข๐ž๐ฏ๐š๐ฅ ๐š๐ง๐ ๐Ÿ๐ž๐ฐ๐ž๐ซ ๐ก๐š๐ฅ๐ฅ๐ฎ๐œ๐ข๐ง๐š๐ญ๐ข๐จ๐ง๐ฌ!

Thought that self-attention could not be improved anymore?

Microsoft researchers have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

๐Ÿง  Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns

๐Ÿ”ฅ DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens

๐Ÿ“ Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively

๐Ÿ”Ž Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations ๐Ÿคฏ

๐Ÿ”ข Reduces activation outliers, potentially enabling lower-bit quantization without performance drop!

โš™๏ธ Can be directly implemented using existing FlashAttention kernels

This new architecture could lead much more capable LLMs, with vastly improved strengths in long-context understanding and factual accuracy.

But they didnโ€™t release weights on the Hub: letโ€™s wait for the community to train the first open-weights DiffTransformer! ๐Ÿš€

Read their paper ๐Ÿ‘‰ย  Differential Transformer (2410.05258)
In this post