@m-ric on Hugging Face: "⚡️ 𝐓𝐡𝐢𝐬 𝐦𝐨𝐧𝐭𝐡'𝐬 𝐦𝐨𝐬𝐭 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭…"

Post

717

⚡️ 𝐓𝐡𝐢𝐬 𝐦𝐨𝐧𝐭𝐡'𝐬 𝐦𝐨𝐬𝐭 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭 𝐛𝐫𝐞𝐚𝐤𝐭𝐡𝐫𝐨𝐮𝐠𝐡: 𝐃𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭𝐢𝐚𝐥 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫 𝐯𝐚𝐬𝐭𝐥𝐲 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 ⇒ 𝐛𝐞𝐭𝐭𝐞𝐫 𝐫𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐚𝐧𝐝 𝐟𝐞𝐰𝐞𝐫 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬!

Thought that self-attention could not be improved anymore?

Microsoft researchers have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:

🧠 Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns

🔥 DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens

📏 Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively

🔎 Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations 🤯

🔢 Reduces activation outliers, potentially enabling lower-bit quantization without performance drop!

⚙️ Can be directly implemented using existing FlashAttention kernels

This new architecture could lead much more capable LLMs, with vastly improved strengths in long-context understanding and factual accuracy.

But they didn’t release weights on the Hub: let’s wait for the community to train the first open-weights DiffTransformer! 🚀

Read their paper 👉 Differential Transformer (2410.05258)

Join the conversation