Spaces:
Paused
Paused
File size: 2,140 Bytes
e0a78f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
20 FUNDAMENTALS Quadratic complexity Figure 2.3. Illustration of the main attention mechanisms in a Transformer. embedding with a fixed frequency and phase, f (x) = sin(ωx + φ), where ω is the frequency and φ is the phase which are learned as part of the training process, and they are typically shared across all tokens in the sequence. Integrating position information into Transformers can be achieved in different ways, which [105, Table 1] gives an overview for. Transformers have gradually taken over as an end-to-end architecture for both NLP and CV tasks, albeit adoption in CV has been slower, due to the lack of spatial invariance in the original Transformer architecture. This has been addressed by recent works, such as Vision Transformer (ViT) [101], which uses a patch-based input representation with position embeddings. A large language model (LLM) consists of a stack of Transformers that is pretrained on a large corpus of text, typically using a self-supervised learning objective, such as predicting the next token in a sequence. The goal of LLMs is to learn a general-purpose language representation that can be fine-tuned to perform well on a wide range of downstream tasks. LLMs have disrupted NLP in recent years, as they have achieved SOTA performance on a wide range of tasks thanks to pretraining on large amounts of data. The most popular LLMs are BERT [95], RoBERTa [287], ELECTRA [73], T5 [383], GPT-3 [52], Llama-2 [452], and Mistral [199]. Next to challenges specific to modeling document inputs, explained in Section 2.3.4, open challenges for LLMs include: (i) structured output generation, (ii) domain-specific knowledge injection (e.g., does retrieval-augmented generation (RAG) suffice? [253, 347]), (iii) multimodality. Vision-language models (VLM) are a recent development in multimodal learning, which combine the power of LLMs with vision encoders to perform tasks that require understanding both visual and textual information. The most popular VLMs are CLIP [381], UNITER [70], FLAVA [423] and GPT-4 [344]. In every chapter of this dissertation we have used Transformers, either as part |