Spaces:
Paused
Paused
18 | |
2.1.3.2 | |
FUNDAMENTALS | |
Language Neural Networks | |
The first step to represent language input into a format compatible with NNs is | |
to convert units of language, words or characters or “tokens” as depending on | |
a tokenizer, into numerical vectors. This is done by means of embeddings, | |
which are typically learned as part of the training process, and are used to | |
represent the meaning of words in a continuous vector space. There have been | |
multiple generations of word embeddings, starting with one-hot vectors that | |
represent each word by a vector of zeros with a single one at its vocabulary index, | |
which depends highly on the tokenizer used and does not capture semantic | |
relationships between words. Alternatives are frequency-based embeddings, | |
such as TF-IDF vectors, which represent each word by its frequency in the | |
corpus, weighted by its inverse frequency in the corpus, capturing some lexical | |
semantics, but not the context in which the word appears. The next generation | |
are Word2Vec embeddings that are trained to predict the context of a word, i.e., | |
the words that appear before and after it in a sentence. FastText embeddings | |
improve this by considering a character n-gram context, i.e., a sequence of n | |
characters. The current generation are contextual word embeddings that | |
are trained to predict the context of a word, taking into account the surrounding | |
context and learning the sense of a word based on its context, e.g., ‘bank’ as | |
a river bank vs. a financial institution in ‘Feliz sits at the bank of the river | |
Nete’. Another important innovation is subword tokenization to deal with | |
the out-of-vocabulary (OOV) problem, which is particularly relevant for | |
morphologically rich languages, such as Dutch, where word meaning can be | |
inferred from its subwords. A clever extension is byte pair encoding (BPE) | |
[412], which is a data compression algorithm that iteratively replaces the most | |
frequent pair of bytes in a sequence with a single, unused byte, until a predefined | |
vocabulary size is reached. This is particularly useful for multilingual models, | |
where the vocabulary size would otherwise be too large to fit in memory. | |
The first embedding layer is typically a lookup table, which maps each word | |
to a unique index in a vocabulary, and each index to a vector of real numbers. | |
The embedding layer is typically followed by a recurrent, convolutional or | |
attention layer, which is used to capture the sequential nature of language. | |
Recurrent Neural Networks (RNNs) and recurrent architectures extended | |
to model long-range dependencies such as Long Short-Term Memory (LSTM) | |
and Gated Recurrent Unit (GRU) networks were the dominant architectures | |
for sequence modeling in NLP, yet they have been superseded by Transformers | |
in recent years. | |