Spaces:
Paused
Paused
STATISTICAL LEARNING | |
2.1.3.3 | |
19 | |
Transformer Network | |
A Transformer [473] is a sequence-to-sequence model that uses an attention | |
mechanism to capture long-range dependencies in the input sequence, benefiting | |
from increased parallelization. Traditionally, it consists of an encoder and a | |
decoder, each composed of multiple layers of self-attention and feed-forward | |
layers. | |
Attention is a mechanism that allows for soft selection of relevant information | |
from a set of candidates, e.g., tokens in a document, based on a query, e.g., | |
a token in the document. The scaled dot-product P | |
attention is defined | |
n | |
for a sequence of length n as follows: Att(Q, K, V ) = i=1 αi Vi . It utilizes | |
three learnable weight matrices, each multiplied with all token embeddings in a | |
sequence to build queries Q ∈ Rn×dq , keys K ∈ Rn×dq , and values V ∈ Rn×dv . | |
The output of the attention mechanism is a weighted sum of the unnormalized | |
values, where each attention weight of the i-th key is computed by normalizing | |
exp(QT | |
i Ki ) | |
the dot product between the query and key vectors αi = Pn exp(Q | |
T K ) . For | |
j=1 | |
J | |
j | |
training stability, the dot product is typically scaled by the square root of the | |
dimensionality of the query and key vectors. This is followed by a feed-forward | |
layer to capture non-linear relationships between the tokens in the sequence. | |
There exist different forms of attention, depending on the type of relationship | |
that is captured. Self-attention computes the attention of each token w.r.t. | |
all other tokens in the sequence, which changes the representation of each token | |
based on the other tokens in the sequence. Multi-head attention is a set | |
of h attention layers, which every Transformer uses to concurrently capture | |
different types of relationships, concatenated together after the parallelized | |
processing. Cross-attention computes the attention of each token in one | |
sequence w.r.t. all tokens in another sequence, which is used in encoder-decoder | |
Transformer architectures for e.g., summarization and machine translation. | |
Specific to decoder layers, masked attention is used to prevent the decoder | |
from attending to future tokens in the sequence by masking the upper triangle | |
of the attention matrix calculation. | |
A major downside to Transformers is the quadratic complexity of the attention | |
mechanism (Figure 2.3), which makes them computationally inefficient for long | |
sequences. This has been addressed by a wealth of techniques [120], such as | |
sparsifing attention, targeting recurrence, downsampling, random or low-rank | |
approximations. | |
Position Embeddings are indispensable for Transformers to be able to process | |
sequences, as they do not have any notion of order or position of tokens in | |
a sequence. The most common type of position embedding is a sinusoidal | |