Documentation

`DatasetBuilder`

DatasetBuilder

DatasetBuilder provides a convenient way to build datasets for training the Andromeda model.

Constructor

def __init__(
    self,
    dataset_name,
    seq_len=8192,
    num_cpu=None,
    hf_account_repo=None,
    tokenizer="EleutherAI/gpt-neox-20b",
)

Initialize the DatasetBuilder.

Args:

dataset_name (str): Name of the dataset to process.
seq_len (int): Maximum sequence length.
num_cpu (int, optional): Number of CPU cores to use for multiprocessing. Defaults to None.
hf_account_repo (str, optional): Hugging Face account name and repository to push the processed dataset. Defaults to None.
tokenizer (str, optional): Tokenizer model to use. Defaults to "EleutherAI/gpt-neox-20b".

Methods

build_dataset

def build_dataset(self) -> torch.utils.data.Dataset

Build and process the dataset.

Returns:

torch.utils.data.Dataset: The processed dataset ready for training.

AndromedaTokenizer

Purpose

The AndromedaTokenizer class provides tokenization functionality using the Hugging Face tokenizer. It allows you to tokenize texts using the specified tokenizer model.

Systems Understanding

The AndromedaTokenizer class initializes a tokenizer model from the Hugging Face library. It uses the AutoTokenizer.from_pretrained method to load the tokenizer model with specific parameters such as the EOS token, pad token, extra IDs, and model maximum length. The tokenize_texts method tokenizes input texts using the tokenizer model and returns the tokenized input IDs.

Usage Example

from Andromeda import AndromedaTokenizer

# Initialize the tokenizer
tokenizer = AndromedaTokenizer()

# Tokenize texts
texts = ["This is an example sentence.", "Another example sentence."]
tokenized_ids = tokenizer.tokenize_texts(texts)

print(tokenized_ids)

Andromeda

Purpose

The Andromeda class is a transformer-based model architecture. It consists of a Transformer and AutoregressiveWrapper with default or user-specified parameters.

Systems Understanding

The Andromeda class initializes with a Transformer and AutoregressiveWrapper. The Transformer encapsulates the main transformer model, and the AutoregressiveWrapper enables autoregressive generation using the transformer model.

The constructor of the Andromeda class takes various parameters that define the architecture of the model, such as the number of tokens, maximum sequence length, model dimension, depth, number of heads, etc. These parameters are used to initialize the Transformer and AutoregressiveWrapper with the specified configuration.

The forward method performs a forward pass through the model. It takes the input text_tokens as input and passes it through the Decoder module inside the Andromeda model. The output from the decoder is returned as the result.

Usage Example

from Andromeda import Andromeda

# Create an instance of the Andromeda model
model = Andromeda()

# Define the input text tokens
text_tokens = [1, 2, 3, 4, 5]  # Example input tokens

# Perform a forward pass through the model
output = model.forward(text_tokens)

print(output)

Constructor

def __init__(self, num_tokens=50304, max_seq_len=8192, dim=2560, depth=32, dim_head=128, heads=24, use_abs_pos_emb=False, alibi_pos_bias=True, alibi_num_heads=12, rotary_xpos=True, attn_flash=True, deepnorm=True, shift_tokens=1, attn_one_kv_head=True, qk_norm=True, attn_qk_norm=True, attn_qk_norm_dim_scale=True, embedding_provider=AndromedaEmbedding())

num_tokens (optional): Number of tokens in the vocabulary.
max_seq_len (optional): Maximum sequence length.
dim (optional): Dimension of the model.
depth (optional): Depth of the model.
dim_head (optional): Dimension of the model head.
heads (optional): Number of heads.
use_abs_pos_emb (optional): Whether to use absolute position embedding.
alibi_pos_bias (optional): Alibi position bias.
alibi_num_heads (optional): Number of alibi heads.
rotary_xpos (optional): Rotary position.
attn_flash (optional): Attention flash.
deepnorm (optional): Deep normalization.
shift_tokens (optional): Number of tokens to shift.
attn_one_kv_head (optional): Attention one key/value head.
qk_norm (optional): Query-key normalization.
attn_qk_norm (optional): Attention query-key normalization.
attn_qk_norm_dim_scale (optional): Attention query-key normalization dimension scale.
embedding_provider (optional): Embedding provider module.

Methods

forward(text_tokens, **kwargs): Performs a forward pass through the model.
- text_tokens (required): Input tokens.
- kwargs (optional): Other arguments.

Args

text_tokens (list): Input tokens.

Returns

Output from the decoder module.

Conclusion

The Andromeda module provides a transformer-based model architecture for text generation. The AndromedaTokenizer class allows you to tokenize texts using the specified tokenizer model. The Andromeda class initializes with a transformer and autoregressive wrapper, providing the functionality for text generation. By using the provided classes and methods, you can generate text using the Andromeda model.