ask_my_thesis / assets /txts /pg_0050.txt
jordyvl's picture
First commit
e0a78f5
raw
history blame
2.77 kB
18
2.1.3.2
FUNDAMENTALS
Language Neural Networks
The first step to represent language input into a format compatible with NNs is
to convert units of language, words or characters or “tokens” as depending on
a tokenizer, into numerical vectors. This is done by means of embeddings,
which are typically learned as part of the training process, and are used to
represent the meaning of words in a continuous vector space. There have been
multiple generations of word embeddings, starting with one-hot vectors that
represent each word by a vector of zeros with a single one at its vocabulary index,
which depends highly on the tokenizer used and does not capture semantic
relationships between words. Alternatives are frequency-based embeddings,
such as TF-IDF vectors, which represent each word by its frequency in the
corpus, weighted by its inverse frequency in the corpus, capturing some lexical
semantics, but not the context in which the word appears. The next generation
are Word2Vec embeddings that are trained to predict the context of a word, i.e.,
the words that appear before and after it in a sentence. FastText embeddings
improve this by considering a character n-gram context, i.e., a sequence of n
characters. The current generation are contextual word embeddings that
are trained to predict the context of a word, taking into account the surrounding
context and learning the sense of a word based on its context, e.g., ‘bank’ as
a river bank vs. a financial institution in ‘Feliz sits at the bank of the river
Nete’. Another important innovation is subword tokenization to deal with
the out-of-vocabulary (OOV) problem, which is particularly relevant for
morphologically rich languages, such as Dutch, where word meaning can be
inferred from its subwords. A clever extension is byte pair encoding (BPE)
[412], which is a data compression algorithm that iteratively replaces the most
frequent pair of bytes in a sequence with a single, unused byte, until a predefined
vocabulary size is reached. This is particularly useful for multilingual models,
where the vocabulary size would otherwise be too large to fit in memory.
The first embedding layer is typically a lookup table, which maps each word
to a unique index in a vocabulary, and each index to a vector of real numbers.
The embedding layer is typically followed by a recurrent, convolutional or
attention layer, which is used to capture the sequential nature of language.
Recurrent Neural Networks (RNNs) and recurrent architectures extended
to model long-range dependencies such as Long Short-Term Memory (LSTM)
and Gated Recurrent Unit (GRU) networks were the dominant architectures
for sequence modeling in NLP, yet they have been superseded by Transformers
in recent years.