metadata

license: mit
datasets:
  - HuggingFaceTB/cosmopedia
  - bigcode/starcoderdata
  - shivendrra/consolidated-datasets
language:
  - en
tags:
  - transformers
  - bert
  - decoder-only
  - encoder-decoder
  - mixture of experts
  - moe
  - MoE
  - aiva-500m
  - transformer model
  - llm
  - small scale model

aiva-4x500m

Model Details

This is a transformer based model trained on [cosmopedia] and [starcoder] datasets. This is able to generate new sequences and classify the emotions and sentiments in the speech. Uses MoE same as Mistral's 8x7b model, but uses 4 of 500million models.

For now it only has the language models, but I'm working on vision and audio model which will be uploaded soon.

Model Description

Developed by: Shivendra Singh
License: [MIT]
Train loss: 0.2035
Accuracy: Not yet determined(for next token prediction)

Model Sources

Repository: github/aiva-4x500m
Papers: None

Uses

For now, language model can be used to generate new tokens, masked token prediction and sentiment analysis. But in future, it will be paired along with the audio and vision models to make it work like AVA from ex-machina. It could listen to the human, talk to them and understand sentiments, emotions, and actions using it's vision and audio capabilities.

Training Details

Training Data

Used from this dataset: cosmopedia, shivendrra/consolidated-datasets, starcoderdata ### Training Procedure

Transformer based model was trained for 35k iteration on 3.5billion tokens for more around 25hrs on google colab's T4 gpu. I had access to a lot more data but I didn't train it further because of budget issues and technical limitations.

Functions:

This used a basic training procedure. get_batch() generated batches of data, estimate_loss() estimates losses and train() function is kind of master function, here, calling other functions after each or set iterations.

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)

    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

for iter in range(max_iters):
  if iter % eval_interval == 0 or iter == max_iters - 1:
    losses = estimate_loss()
    print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

  xb, yb = get_batch('train')
  logits, loss = model(xb, yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()

Training Hyperparameters

Configurations are saved in the base/config.json file. suitable for 500million encoder-decoder model.

{
  "batch_size": 10,
  "block_size": 256,
  "max_iters": 5000,
  "eval_interval": 50,
  "learning_rate": 3e-5,
  "eval_iters": 100,
  "d_model": 512,
  "n_head": 18,
  "n_layer": 12,
  "dropout": 0.2,
  "norm_eps": 1e-5
}

Model Architecture and Objective

There is one trained model uploaded for now, a 536million parameter transformer model that is trained for over 35k iterations. It uses RMS norm and has context size of 256-tokens only. tiktoken is used for tokenization, and tokenization file is also included configured accordingly to the trained model Decoder-based model isn't uploaded for now, it's a little hard to train due to it's complexity. But will be uploaded soon.

Highlights

RMS Normalization & Pre-normalization: Both of the model uses RMS normalization same as implemented in LLaMa-2 and uses pre-normalization for model's stability while training.
Self-Attention Layer: Encoder and Final attention layer's have no masking and the key, query and values have bias added to them. Decoder-Attention layer has a triangular mask applied to them, without any biases. Also, Encoder-attention has relative positional embeddings added to attention matrix, before softmax.
FeedForward: Basic feed-forward network that has two linear layers with expansion factor of 5. GELU is used as activation function for this model instead of ReLU.
Generation: Token generation function uses top_k, top_p and beaming along with temperature scaling, but there is some bug, because it's not working as it supposed to work. I'll try to correct it and then upload again.