File size: 2,140 Bytes
e0a78f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
20

FUNDAMENTALS

Quadratic complexity

Figure 2.3. Illustration of the main attention mechanisms in a Transformer.

embedding with a fixed frequency and phase, f (x) = sin(ωx + φ), where ω is the
frequency and φ is the phase which are learned as part of the training process,
and they are typically shared across all tokens in the sequence. Integrating
position information into Transformers can be achieved in different ways, which
[105, Table 1] gives an overview for.
Transformers have gradually taken over as an end-to-end architecture for both
NLP and CV tasks, albeit adoption in CV has been slower, due to the lack
of spatial invariance in the original Transformer architecture. This has been
addressed by recent works, such as Vision Transformer (ViT) [101], which uses
a patch-based input representation with position embeddings.
A large language model (LLM) consists of a stack of Transformers that is
pretrained on a large corpus of text, typically using a self-supervised learning
objective, such as predicting the next token in a sequence. The goal of LLMs
is to learn a general-purpose language representation that can be fine-tuned
to perform well on a wide range of downstream tasks. LLMs have disrupted
NLP in recent years, as they have achieved SOTA performance on a wide
range of tasks thanks to pretraining on large amounts of data. The most
popular LLMs are BERT [95], RoBERTa [287], ELECTRA [73], T5 [383],
GPT-3 [52], Llama-2 [452], and Mistral [199]. Next to challenges specific to
modeling document inputs, explained in Section 2.3.4, open challenges for
LLMs include: (i) structured output generation, (ii) domain-specific knowledge
injection (e.g., does retrieval-augmented generation (RAG) suffice? [253, 347]),
(iii) multimodality.
Vision-language models (VLM) are a recent development in multimodal
learning, which combine the power of LLMs with vision encoders to perform
tasks that require understanding both visual and textual information. The most
popular VLMs are CLIP [381], UNITER [70], FLAVA [423] and GPT-4 [344].
In every chapter of this dissertation we have used Transformers, either as part