Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0050.txt

jordyvl

First commit

e0a78f5 8 months ago

raw

history blame

2.77 kB

	18

	2.1.3.2

	FUNDAMENTALS

	Language Neural Networks

	The first step to represent language input into a format compatible with NNs is
	to convert units of language, words or characters or “tokens” as depending on
	a tokenizer, into numerical vectors. This is done by means of embeddings,
	which are typically learned as part of the training process, and are used to
	represent the meaning of words in a continuous vector space. There have been
	multiple generations of word embeddings, starting with one-hot vectors that
	represent each word by a vector of zeros with a single one at its vocabulary index,
	which depends highly on the tokenizer used and does not capture semantic
	relationships between words. Alternatives are frequency-based embeddings,
	such as TF-IDF vectors, which represent each word by its frequency in the
	corpus, weighted by its inverse frequency in the corpus, capturing some lexical
	semantics, but not the context in which the word appears. The next generation
	are Word2Vec embeddings that are trained to predict the context of a word, i.e.,
	the words that appear before and after it in a sentence. FastText embeddings
	improve this by considering a character n-gram context, i.e., a sequence of n
	characters. The current generation are contextual word embeddings that
	are trained to predict the context of a word, taking into account the surrounding
	context and learning the sense of a word based on its context, e.g., ‘bank’ as
	a river bank vs. a financial institution in ‘Feliz sits at the bank of the river
	Nete’. Another important innovation is subword tokenization to deal with
	the out-of-vocabulary (OOV) problem, which is particularly relevant for
	morphologically rich languages, such as Dutch, where word meaning can be
	inferred from its subwords. A clever extension is byte pair encoding (BPE)
	[412], which is a data compression algorithm that iteratively replaces the most
	frequent pair of bytes in a sequence with a single, unused byte, until a predefined
	vocabulary size is reached. This is particularly useful for multilingual models,
	where the vocabulary size would otherwise be too large to fit in memory.
	The first embedding layer is typically a lookup table, which maps each word
	to a unique index in a vocabulary, and each index to a vector of real numbers.
	The embedding layer is typically followed by a recurrent, convolutional or
	attention layer, which is used to capture the sequential nature of language.
	Recurrent Neural Networks (RNNs) and recurrent architectures extended
	to model long-range dependencies such as Long Short-Term Memory (LSTM)
	and Gated Recurrent Unit (GRU) networks were the dominant architectures
	for sequence modeling in NLP, yet they have been superseded by Transformers
	in recent years.