arxiv:2410.05258

Differential Transformer

Published on Oct 7

· Submitted by

unilm on Oct 8

#1 Paper of the day

Upvote

165

Authors:

Tianzhu Ye ,

Li Dong ,

Yuqing Xia ,

Yutao Sun ,

Furu Wei

Abstract

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

View arXiv page View PDF Add to collection

Community

unilm

Paper author Paper submitter 29 days ago

Bachstelze

29 days ago

Does the differential transformer get rid of the attention sink?

ytz20

Paper author 29 days ago

We observe that Diff Transformer allocates less attention scores to attention sinks, i.e., the first few tokens in the sequence.
Specifically, in language modeling task, Diff Transformer allocates less than 5% scores to the BOS token, while Transformer allocates about 25%. For the key information retrieval task, please refer to Figure 1 in the paper. We find that models attend the BOS token more when there is less useful information in the context.

browniepoints

29 days ago

Great stuff. I would love to see comparisons against MöbiusAttention, which is learns to forget...but this is seems way more computationally efficient.

ytz20

Paper author 29 days ago

Thanks for pointing out this paper. We will study into it.

espadrine

29 days ago

It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads.

I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.

ytz20

Paper author 29 days ago

In Diff Transformer, we split heads instead of doubling heads. No extra QK projection parameters are introduced. Heads of Q and K are split into two groups and compute in pairs. In a pair they share the same V with dimension 2d. With this design, we match flops and parameter counts with Transformer.
Using max(0, exp(x)-1) might be an approach that solves the problem. We didn't try this because we believe the property of exp() is important to learning.

tmp1234

29 days ago

Great work! Just wonder do you have any idea why two learned attentions tend to cancel noise, rather than canceling signals? For instance, if attention 1 learns S + N_1, and attention 2 learns S + N_2 (where S is signal, N_1, N_2 are different noises), by subtracting these two, the signal S gets canceled while noise becomes N_1 - N_2 which could be more complicated. Is there any reason why the model would not do this instead?

ytz20

Paper author 29 days ago

•

edited 29 days ago

It's a good question. Our observation is that the model knows what signal is and what noise is. Notice that attention_1 and attention_2 are both calculated with learnable parameters, they can "perceive" each other in the training process. Then they can adjust themselves according to each other, to achieve lower loss. The result is that the model chooses to preserve signal and cancel out noise as long as we give it the chance to do so. And for a single softmax, it's difficult for it to learn the same solution, due to its formulation and gradient properties.

librarian-bot

28 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

acnagle

28 days ago

Exciting work! Do the authors plan to release model weights on hugging face?

ytz20

Paper author 28 days ago

No, we won't release model weights on hugging face.

quyettv

28 days ago

What if it applies to the Linear Attention acceleration version?

ytz20

Paper author 28 days ago

That's an interesting problem to explore. We haven't tried that yet. We will look into it in the future.

mathlfs

28 days ago

Can you provide an intuition why \lambda is re-parameterized as the form shown in the paper?

ytz20

Paper author 28 days ago

Sure. lambda is multiplied to softmax, where softmax = exp(qk) / Sigma(exp(qk)). Parameters in lambda learns with the same rate as other parameters in the model, therefore lambda should take a similar formulation as softmax. That's why lambda = exp(lambda_q * lambda_k) + lambda_init. Moreover, to enable lambda to learn values smaller than lambda_init, we add the second term, i.e., lambda = exp(lambda_q1 * lambda_k1) - exp(lambda_q2 * lambda_k2) + lambda_init

jnemecek

28 days ago

What kind of hardware was required to train this, and how did the tokens per second output compare with transformers?

ytz20

Paper author 28 days ago

No requirements for hardware if you use the naive implementation. If you use flashdiff, refer to FlashAttention repo (https://github.com/Dao-AILab/flash-attention) for hardware and datatype requirements.
Our speed test is performed on Nvidia H100-80GB GPU cards and we calculate throughput (tokens per second). The same cards and environment are used for both Diff and Transformer.

vrcoder045

28 days ago

The work looks exciting and I really like the motivation coming from noise cancellation!
I have a few questions -

Won't this model let the post-attention weight (softmax(...) - \lambda * softmax(...)) for some value vectors be negative? Is that a design choice? One explanation does come to mind i.e. wanting to get opposing contributions from some tokens specifically but I am unsure if this is desired.
This recent work (https://arxiv.org/pdf/2410.01104) shows that attention will disperse given a few conditions (see Lemma 2.1, Page 3). Do you think differential attention is any different? If I understand the proposal correctly, I think it still satisfies Lemma 2.1 with some minor modifications in the proof.

Thanks again for your wonderful work!

ytz20

Paper author 27 days ago

•

edited 27 days ago

Yes, there are some negative values in the post-subtraction weight, and that's what we want. The design can expand the representation space of attention weights, which promotes modeling capability. The model is free to allocate positive or negative values to tokens.
If I understand correctly, Diff can break the property in Lemma 2.1. In the paper, Equation 4 points out that values of a single softmax output have a positive lower bound, as input logits can't reach negative infinity. However, by taking the difference of two softmax, the output range includes 0 in it, which means the attention weights is not O(1/n) anymore. This breaks Lemma 2.1. Diff can generate 0 as attention values if it wants and assign it to unwanted context, in the meanwhile, leave almost all attention for key information.

m-ric

23 days ago

Here's my summary of this paper:

⚡️ 𝐌𝐨𝐬𝐭 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭 𝐛𝐫𝐞𝐚𝐤𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐭𝐡𝐢𝐬 𝐦𝐨𝐧𝐭𝐡: 𝐃𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭𝐢𝐚𝐥 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫 𝐯𝐚𝐬𝐭𝐥𝐲 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 ⇒ 𝐛𝐞𝐭𝐭𝐞𝐫 𝐫𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐚𝐧𝐝 𝐟𝐞𝐰𝐞𝐫 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬!

Thought that self-attention could not be improved anymore?

Microsoft researchers have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:

🧠 Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns

🔥 DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens

📏 Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively

🔎 Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations 🤯

🔢 Reduces activation outliers, potentially enabling lower-bit quantization without performance drop!

⚙️ Can be directly implemented using existing FlashAttention kernels

This new architecture could lead much more capable LLMs, with vastly improved strengths in long-context understanding and factual accuracy.

But they didn’t release weights on the Hub: let’s wait for the community to train the first open-weights DiffTransformer! 🚀

imdatta0

20 days ago

This paper was a great read. We wrote a summary blog about this paper and a few more like

TPI LLM
Differential Transformer
ARIA
You can find it here. Please give it a read :)

boom90lb

15 days ago

I only have 1 burning question about this paper, is this architecture compatible with the attention mechanism method described in "Selective Attention Improves Transformer"?

ytz20

Paper author 15 days ago

Hi, we haven't tried to combine them together. Diff Transformer and Selective Attention are proposed from different views and solve different problems. I believe they are compatible.

LizaKovtun

6 days ago

The proposed approach sounds intriguing. Thanks for your work!

Can you provide any intuition and/or theoretical justification on why vanilla softmax attention fails to deal with noisy tokens in a proper way? Where are the weaknesses in its structure that prevent ignoring irrelevant tokens in a sequence and concentrating on the essential ones?

ytz20

Paper author 6 days ago

Hi, you can refer to a recent paper "softmax is not enough for sharp out-of-distribution" (https://arxiv.org/abs/2410.01104).
In simple terms, 1. Softmax can't produce zero scores due to its definition; 2. Producing near-zero scores needs a wide input range which harms backpropagation of softmax. That's why the model fails to cancel out irrelavant tokens.