# Documentation ## `DatasetBuilder` ### DatasetBuilder DatasetBuilder provides a convenient way to build datasets for training the Andromeda model. #### Constructor ```python def __init__( self, dataset_name, seq_len=8192, num_cpu=None, hf_account_repo=None, tokenizer="EleutherAI/gpt-neox-20b", ) ``` Initialize the DatasetBuilder. **Args:** - `dataset_name` (str): Name of the dataset to process. - `seq_len` (int): Maximum sequence length. - `num_cpu` (int, optional): Number of CPU cores to use for multiprocessing. Defaults to None. - `hf_account_repo` (str, optional): Hugging Face account name and repository to push the processed dataset. Defaults to None. - `tokenizer` (str, optional): Tokenizer model to use. Defaults to "EleutherAI/gpt-neox-20b". #### Methods ##### build_dataset ```python def build_dataset(self) -> torch.utils.data.Dataset ``` Build and process the dataset. **Returns:** - `torch.utils.data.Dataset`: The processed dataset ready for training. ## AndromedaTokenizer ### Purpose The `AndromedaTokenizer` class provides tokenization functionality using the Hugging Face tokenizer. It allows you to tokenize texts using the specified tokenizer model. ### Systems Understanding The `AndromedaTokenizer` class initializes a tokenizer model from the Hugging Face library. It uses the `AutoTokenizer.from_pretrained` method to load the tokenizer model with specific parameters such as the EOS token, pad token, extra IDs, and model maximum length. The `tokenize_texts` method tokenizes input texts using the tokenizer model and returns the tokenized input IDs. ### Usage Example ```python from Andromeda import AndromedaTokenizer # Initialize the tokenizer tokenizer = AndromedaTokenizer() # Tokenize texts texts = ["This is an example sentence.", "Another example sentence."] tokenized_ids = tokenizer.tokenize_texts(texts) print(tokenized_ids) ``` ## Andromeda ### Purpose The `Andromeda` class is a transformer-based model architecture. It consists of a `Transformer` and `AutoregressiveWrapper` with default or user-specified parameters. ### Systems Understanding The `Andromeda` class initializes with a `Transformer` and `AutoregressiveWrapper`. The `Transformer` encapsulates the main transformer model, and the `AutoregressiveWrapper` enables autoregressive generation using the transformer model. The constructor of the `Andromeda` class takes various parameters that define the architecture of the model, such as the number of tokens, maximum sequence length, model dimension, depth, number of heads, etc. These parameters are used to initialize the `Transformer` and `AutoregressiveWrapper` with the specified configuration. The `forward` method performs a forward pass through the model. It takes the input `text_tokens` as input and passes it through the `Decoder` module inside the `Andromeda` model. The output from the decoder is returned as the result. ### Usage Example ```python from Andromeda import Andromeda # Create an instance of the Andromeda model model = Andromeda() # Define the input text tokens text_tokens = [1, 2, 3, 4, 5] # Example input tokens # Perform a forward pass through the model output = model.forward(text_tokens) print(output) ``` ### Constructor ```python def __init__(self, num_tokens=50304, max_seq_len=8192, dim=2560, depth=32, dim_head=128, heads=24, use_abs_pos_emb=False, alibi_pos_bias=True, alibi_num_heads=12, rotary_xpos=True, attn_flash=True, deepnorm=True, shift_tokens=1, attn_one_kv_head=True, qk_norm=True, attn_qk_norm=True, attn_qk_norm_dim_scale=True, embedding_provider=AndromedaEmbedding()) ``` - `num_tokens` (optional): Number of tokens in the vocabulary. - `max_seq_len` (optional): Maximum sequence length. - `dim` (optional): Dimension of the model. - `depth` (optional): Depth of the model. - `dim_head` (optional): Dimension of the model head. - `heads` (optional): Number of heads. - `use_abs_pos_emb` (optional): Whether to use absolute position embedding. - `alibi_pos_bias` (optional): Alibi position bias. - `alibi_num_heads` (optional): Number of alibi heads. - `rotary_xpos` (optional): Rotary position. - `attn_flash` (optional): Attention flash. - `deepnorm` (optional): Deep normalization. - `shift_tokens` (optional): Number of tokens to shift. - `attn_one_kv_head` (optional): Attention one key/value head. - `qk_norm` (optional): Query-key normalization. - `attn_qk_norm` (optional): Attention query-key normalization. - `attn_qk_norm_dim_scale` (optional): Attention query-key normalization dimension scale. - `embedding_provider` (optional): Embedding provider module. ### Methods - `forward(text_tokens, **kwargs)`: Performs a forward pass through the model. - `text_tokens` (required): Input tokens. - `kwargs` (optional): Other arguments. ### Args - `text_tokens` (list): Input tokens. ### Returns - Output from the decoder module. ## Conclusion The Andromeda module provides a transformer-based model architecture for text generation. The `AndromedaTokenizer` class allows you to tokenize texts using the specified tokenizer model. The `Andromeda` class initializes with a transformer and autoregressive wrapper, providing the functionality for text generation. By using the provided classes and methods, you can generate text using the Andromeda model.