File size: 2,785 Bytes
e0a78f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
STATISTICAL LEARNING

2.1.3.3

19

Transformer Network

A Transformer [473] is a sequence-to-sequence model that uses an attention
mechanism to capture long-range dependencies in the input sequence, benefiting
from increased parallelization. Traditionally, it consists of an encoder and a
decoder, each composed of multiple layers of self-attention and feed-forward
layers.
Attention is a mechanism that allows for soft selection of relevant information
from a set of candidates, e.g., tokens in a document, based on a query, e.g.,
a token in the document. The scaled dot-product P
attention is defined
n
for a sequence of length n as follows: Att(Q, K, V ) = i=1 αi Vi . It utilizes
three learnable weight matrices, each multiplied with all token embeddings in a
sequence to build queries Q ∈ Rn×dq , keys K ∈ Rn×dq , and values V ∈ Rn×dv .
The output of the attention mechanism is a weighted sum of the unnormalized
values, where each attention weight of the i-th key is computed by normalizing
exp(QT
i Ki )
the dot product between the query and key vectors αi = Pn exp(Q
T K ) . For
j=1

J

j

training stability, the dot product is typically scaled by the square root of the
dimensionality of the query and key vectors. This is followed by a feed-forward
layer to capture non-linear relationships between the tokens in the sequence.
There exist different forms of attention, depending on the type of relationship
that is captured. Self-attention computes the attention of each token w.r.t.
all other tokens in the sequence, which changes the representation of each token
based on the other tokens in the sequence. Multi-head attention is a set
of h attention layers, which every Transformer uses to concurrently capture
different types of relationships, concatenated together after the parallelized
processing. Cross-attention computes the attention of each token in one
sequence w.r.t. all tokens in another sequence, which is used in encoder-decoder
Transformer architectures for e.g., summarization and machine translation.
Specific to decoder layers, masked attention is used to prevent the decoder
from attending to future tokens in the sequence by masking the upper triangle
of the attention matrix calculation.
A major downside to Transformers is the quadratic complexity of the attention
mechanism (Figure 2.3), which makes them computationally inefficient for long
sequences. This has been addressed by a wealth of techniques [120], such as
sparsifing attention, targeting recurrence, downsampling, random or low-rank
approximations.
Position Embeddings are indispensable for Transformers to be able to process
sequences, as they do not have any notion of order or position of tokens in
a sequence. The most common type of position embedding is a sinusoidal