Documentation
DatasetBuilder
DatasetBuilder
DatasetBuilder provides a convenient way to build datasets for training the Andromeda model.
Constructor
def __init__(
self,
dataset_name,
seq_len=8192,
num_cpu=None,
hf_account_repo=None,
tokenizer="EleutherAI/gpt-neox-20b",
)
Initialize the DatasetBuilder.
Args:
dataset_name
(str): Name of the dataset to process.seq_len
(int): Maximum sequence length.num_cpu
(int, optional): Number of CPU cores to use for multiprocessing. Defaults to None.hf_account_repo
(str, optional): Hugging Face account name and repository to push the processed dataset. Defaults to None.tokenizer
(str, optional): Tokenizer model to use. Defaults to "EleutherAI/gpt-neox-20b".
Methods
build_dataset
def build_dataset(self) -> torch.utils.data.Dataset
Build and process the dataset.
Returns:
torch.utils.data.Dataset
: The processed dataset ready for training.
AndromedaTokenizer
Purpose
The AndromedaTokenizer
class provides tokenization functionality using the Hugging Face tokenizer. It allows you to tokenize texts using the specified tokenizer model.
Systems Understanding
The AndromedaTokenizer
class initializes a tokenizer model from the Hugging Face library. It uses the AutoTokenizer.from_pretrained
method to load the tokenizer model with specific parameters such as the EOS token, pad token, extra IDs, and model maximum length. The tokenize_texts
method tokenizes input texts using the tokenizer model and returns the tokenized input IDs.
Usage Example
from Andromeda import AndromedaTokenizer
# Initialize the tokenizer
tokenizer = AndromedaTokenizer()
# Tokenize texts
texts = ["This is an example sentence.", "Another example sentence."]
tokenized_ids = tokenizer.tokenize_texts(texts)
print(tokenized_ids)
Andromeda
Purpose
The Andromeda
class is a transformer-based model architecture. It consists of a Transformer
and AutoregressiveWrapper
with default or user-specified parameters.
Systems Understanding
The Andromeda
class initializes with a Transformer
and AutoregressiveWrapper
. The Transformer
encapsulates the main transformer model, and the AutoregressiveWrapper
enables autoregressive generation using the transformer model.
The constructor of the Andromeda
class takes various parameters that define the architecture of the model, such as the number of tokens, maximum sequence length, model dimension, depth, number of heads, etc. These parameters are used to initialize the Transformer
and AutoregressiveWrapper
with the specified configuration.
The forward
method performs a forward pass through the model. It takes the input text_tokens
as input and passes it through the Decoder
module inside the Andromeda
model. The output from the decoder is returned as the result.
Usage Example
from Andromeda import Andromeda
# Create an instance of the Andromeda model
model = Andromeda()
# Define the input text tokens
text_tokens = [1, 2, 3, 4, 5] # Example input tokens
# Perform a forward pass through the model
output = model.forward(text_tokens)
print(output)
Constructor
def __init__(self, num_tokens=50304, max_seq_len=8192, dim=2560, depth=32, dim_head=128, heads=24, use_abs_pos_emb=False, alibi_pos_bias=True, alibi_num_heads=12, rotary_xpos=True, attn_flash=True, deepnorm=True, shift_tokens=1, attn_one_kv_head=True, qk_norm=True, attn_qk_norm=True, attn_qk_norm_dim_scale=True, embedding_provider=AndromedaEmbedding())
num_tokens
(optional): Number of tokens in the vocabulary.max_seq_len
(optional): Maximum sequence length.dim
(optional): Dimension of the model.depth
(optional): Depth of the model.dim_head
(optional): Dimension of the model head.heads
(optional): Number of heads.use_abs_pos_emb
(optional): Whether to use absolute position embedding.alibi_pos_bias
(optional): Alibi position bias.alibi_num_heads
(optional): Number of alibi heads.rotary_xpos
(optional): Rotary position.attn_flash
(optional): Attention flash.deepnorm
(optional): Deep normalization.shift_tokens
(optional): Number of tokens to shift.attn_one_kv_head
(optional): Attention one key/value head.qk_norm
(optional): Query-key normalization.attn_qk_norm
(optional): Attention query-key normalization.attn_qk_norm_dim_scale
(optional): Attention query-key normalization dimension scale.embedding_provider
(optional): Embedding provider module.
Methods
forward(text_tokens, **kwargs)
: Performs a forward pass through the model.text_tokens
(required): Input tokens.kwargs
(optional): Other arguments.
Args
text_tokens
(list): Input tokens.
Returns
- Output from the decoder module.
Conclusion
The Andromeda module provides a transformer-based model architecture for text generation. The AndromedaTokenizer
class allows you to tokenize texts using the specified tokenizer model. The Andromeda
class initializes with a transformer and autoregressive wrapper, providing the functionality for text generation. By using the provided classes and methods, you can generate text using the Andromeda model.