# Documentation

## `DatasetBuilder`

### DatasetBuilder

DatasetBuilder provides a convenient way to build datasets for training the Andromeda model.

#### Constructor

```python
def __init__(
    self,
    dataset_name,
    seq_len=8192,
    num_cpu=None,
    hf_account_repo=None,
    tokenizer="EleutherAI/gpt-neox-20b",
)
```

Initialize the DatasetBuilder.

**Args:**

- `dataset_name` (str): Name of the dataset to process.
- `seq_len` (int): Maximum sequence length.
- `num_cpu` (int, optional): Number of CPU cores to use for multiprocessing. Defaults to None.
- `hf_account_repo` (str, optional): Hugging Face account name and repository to push the processed dataset. Defaults to None.
- `tokenizer` (str, optional): Tokenizer model to use. Defaults to "EleutherAI/gpt-neox-20b".

#### Methods

##### build_dataset

```python
def build_dataset(self) -> torch.utils.data.Dataset
```

Build and process the dataset.

**Returns:**

- `torch.utils.data.Dataset`: The processed dataset ready for training.


## AndromedaTokenizer

### Purpose

The `AndromedaTokenizer` class provides tokenization functionality using the Hugging Face tokenizer. It allows you to tokenize texts using the specified tokenizer model.

### Systems Understanding

The `AndromedaTokenizer` class initializes a tokenizer model from the Hugging Face library. It uses the `AutoTokenizer.from_pretrained` method to load the tokenizer model with specific parameters such as the EOS token, pad token, extra IDs, and model maximum length. The `tokenize_texts` method tokenizes input texts using the tokenizer model and returns the tokenized input IDs.

### Usage Example

```python
from Andromeda import AndromedaTokenizer

# Initialize the tokenizer
tokenizer = AndromedaTokenizer()

# Tokenize texts
texts = ["This is an example sentence.", "Another example sentence."]
tokenized_ids = tokenizer.tokenize_texts(texts)

print(tokenized_ids)
```

## Andromeda

### Purpose

The `Andromeda` class is a transformer-based model architecture. It consists of a `Transformer` and `AutoregressiveWrapper` with default or user-specified parameters.

### Systems Understanding

The `Andromeda` class initializes with a `Transformer` and `AutoregressiveWrapper`. The `Transformer` encapsulates the main transformer model, and the `AutoregressiveWrapper` enables autoregressive generation using the transformer model.

The constructor of the `Andromeda` class takes various parameters that define the architecture of the model, such as the number of tokens, maximum sequence length, model dimension, depth, number of heads, etc. These parameters are used to initialize the `Transformer` and `AutoregressiveWrapper` with the specified configuration.

The `forward` method performs a forward pass through the model. It takes the input `text_tokens` as input and passes it through the `Decoder` module inside the `Andromeda` model. The output from the decoder is returned as the result.

### Usage Example

```python
from Andromeda import Andromeda

# Create an instance of the Andromeda model
model = Andromeda()

# Define the input text tokens
text_tokens = [1, 2, 3, 4, 5]  # Example input tokens

# Perform a forward pass through the model
output = model.forward(text_tokens)

print(output)
```

### Constructor

```python
def __init__(self, num_tokens=50304, max_seq_len=8192, dim=2560, depth=32, dim_head=128, heads=24, use_abs_pos_emb=False, alibi_pos_bias=True, alibi_num_heads=12, rotary_xpos=True, attn_flash=True, deepnorm=True, shift_tokens=1, attn_one_kv_head=True, qk_norm=True, attn_qk_norm=True, attn_qk_norm_dim_scale=True, embedding_provider=AndromedaEmbedding())
```

- `num_tokens` (optional): Number of tokens in the vocabulary.
- `max_seq_len` (optional): Maximum sequence length.
- `dim` (optional): Dimension of the model.
- `depth` (optional): Depth of the model.
- `dim_head` (optional): Dimension of the model head.
- `heads` (optional): Number of heads.
- `use_abs_pos_emb` (optional): Whether to use absolute position embedding.
- `alibi_pos_bias` (optional): Alibi position bias.
- `alibi_num_heads` (optional): Number of alibi heads.
- `rotary_xpos` (optional): Rotary position.
- `attn_flash` (optional): Attention flash.
- `deepnorm` (optional): Deep normalization.
- `shift_tokens` (optional): Number of tokens to shift.
- `attn_one_kv_head` (optional): Attention one key/value head.
- `qk_norm` (optional): Query-key normalization.
- `attn_qk_norm` (optional): Attention query-key normalization.
- `attn_qk_norm_dim_scale` (optional): Attention query-key normalization dimension scale.
- `embedding_provider` (optional): Embedding provider module.

### Methods

- `forward(text_tokens, **kwargs)`: Performs a forward pass through the model.
  - `text_tokens` (required): Input tokens.
  - `kwargs` (optional): Other arguments.

### Args

- `text_tokens` (list): Input tokens.

### Returns

- Output from the decoder module.

## Conclusion

The Andromeda module provides a transformer-based model architecture for text generation. The `AndromedaTokenizer` class allows you to tokenize texts using the specified tokenizer model. The `Andromeda` class initializes with a transformer and autoregressive wrapper, providing the functionality for text generation. By using the provided classes and methods, you can generate text using the Andromeda model.