Post
717
โก๏ธ ๐๐ก๐ข๐ฌ ๐ฆ๐จ๐ง๐ญ๐ก'๐ฌ ๐ฆ๐จ๐ฌ๐ญ ๐ข๐ฆ๐ฉ๐จ๐ซ๐ญ๐๐ง๐ญ ๐๐ซ๐๐๐ค๐ญ๐ก๐ซ๐จ๐ฎ๐ ๐ก: ๐๐ข๐๐๐๐ซ๐๐ง๐ญ๐ข๐๐ฅ ๐๐ซ๐๐ง๐ฌ๐๐จ๐ซ๐ฆ๐๐ซ ๐ฏ๐๐ฌ๐ญ๐ฅ๐ฒ ๐ข๐ฆ๐ฉ๐ซ๐จ๐ฏ๐๐ฌ ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง โ ๐๐๐ญ๐ญ๐๐ซ ๐ซ๐๐ญ๐ซ๐ข๐๐ฏ๐๐ฅ ๐๐ง๐ ๐๐๐ฐ๐๐ซ ๐ก๐๐ฅ๐ฅ๐ฎ๐๐ข๐ง๐๐ญ๐ข๐จ๐ง๐ฌ!
Thought that self-attention could not be improved anymore?
Microsoft researchers have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐ง Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns
๐ฅ DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens
๐ Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively
๐ Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations ๐คฏ
๐ข Reduces activation outliers, potentially enabling lower-bit quantization without performance drop!
โ๏ธ Can be directly implemented using existing FlashAttention kernels
This new architecture could lead much more capable LLMs, with vastly improved strengths in long-context understanding and factual accuracy.
But they didnโt release weights on the Hub: letโs wait for the community to train the first open-weights DiffTransformer! ๐
Read their paper ๐ย Differential Transformer (2410.05258)
Thought that self-attention could not be improved anymore?
Microsoft researchers have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐ง Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns
๐ฅ DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens
๐ Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively
๐ Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations ๐คฏ
๐ข Reduces activation outliers, potentially enabling lower-bit quantization without performance drop!
โ๏ธ Can be directly implemented using existing FlashAttention kernels
This new architecture could lead much more capable LLMs, with vastly improved strengths in long-context understanding and factual accuracy.
But they didnโt release weights on the Hub: letโs wait for the community to train the first open-weights DiffTransformer! ๐
Read their paper ๐ย Differential Transformer (2410.05258)