naxalpha
/

gated-state-space

@@ -1,12 +1,21 @@
 ---
 license: mit
 ---
 # [Gated State Space](https://arxiv.org/abs/2206.13947)
-This repo contains pretrain model for the gated state space paper. The model has been trained on [C4 dataset](https://huggingface.co/datasets/c4). I have used [Lucidrains' implementation](https://github.com/lucidrains/gated-state-spaces-pytorch) ([commit](https://github.com/lucidrains/gated-state-spaces-pytorch/tree/32cd036e775112cc469e94fa1165fe111393708b)) for the model. I think the main benefit of this model is the ability to scale beyond the training context length. As authors noted in the paper, they trained the model on 4k sequence length but it generalized beyond that length. I have written a blog post on how I started the training [here](https://naxalpha.substack.com/p/devlog-experiment-a2a468-gated-state).
-[Here](https://wandb.ai/naxalpha/gated-state-space/reports/Gated-State-Space-Training-v1--VmlldzozMTYzMzY3?accessToken=zy10rrpofi9k7l52aqwiej8bk0ub302rdswfkxmf8y94dt2j6z4kxbca6ar3sc52) are the training logs (report) on W&B. Since the training is still going on, this repo currently contains the checkpoint @ ~200k step. The report contains both the loss and the output results. While the generated text somewhat makes sense, given the loss is still ~4.2, it is not useful out-of-the-box for most tasks. So need to fine-tune it before using it for anything else.
 ## How to use this.
@@ -38,6 +47,21 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
 ```
-## Training proceedure
-Primarily it has been trained by language model objective. However; I added a few tricks to further optimize the training. The main trick of using pretrained embeddings is explained in the Devlog blog post linked above. The batch size is 8 with a sequence length of 128, the optimizer is AdamW with a learning rate of 2e-5. However; it is trained using gradient accumulation of 4 so the effective batch size is 32. Training happens with two types of losses, one is simple cross entropy for the next token prediction and other is distillation loss from GPT2-xl. During training each loss is alternated. Gradient norm are also clipped at 1.0.

 ---
+library_name: lucidrains/gated-state-spaces-pytorch
 license: mit
+datasets:
+  - c4
+pipeline_tag: text-generation
+tags:
+  - text generation
+  - pytorch
+  - causal-lm
+  - gated-state-space
 ---
 # [Gated State Space](https://arxiv.org/abs/2206.13947)
+This repo contains pretrain model for the gated state space paper. The model has been trained on [C4 dataset](https://huggingface.co/datasets/c4). I have used [Lucidrains' implementation](https://github.com/lucidrains/gated-state-spaces-pytorch) ([commit](https://github.com/lucidrains/gated-state-spaces-pytorch/tree/32cd036e775112cc469e94fa1165fe111393708b)) for the model. I think the main benefit of this model is the ability to scale beyond the training context length. As authors noted in the paper, they trained the model on 4k sequence length but it generalized beyond that length. I have written a **blog post on how I started the training [here](https://naxalpha.substack.com/p/devlog-experiment-a2a468-gated-state)**.
+**[Wandb Report is available at this link](https://wandb.ai/naxalpha/gated-state-space/reports/Gated-State-Space-Training-v1--VmlldzozMTYzMzY3?accessToken=zy10rrpofi9k7l52aqwiej8bk0ub302rdswfkxmf8y94dt2j6z4kxbca6ar3sc52)**
 ## How to use this.
 ```
+## Training Information
+Here are the details of the training:
+- Objective: `Alternate between simple cross entropy and GPT-2 XL distillation`
+- Gradient Accumulation: `4`
+- Batch Size: `8`
+- Sequence Length `128`
+- Learning Rate: `2e-5`
+- Optimizer: `AdamW`
+- Gradient Norm Clipping: `1.0`
+- Hardware: `RTX 3090` on [vast.ai](vast.ai)
+- Training Cost: `~20$`
+- Training Time: `~2 days`
+- Number of steps: `434,000`
+- Tokens seen: `444 million`
+Training code is available in this repo. [Link to the training script](https://huggingface.co/naxalpha/gated-state-space/blob/main/app.py).

README.md CHANGED Viewed

@@ -1,12 +1,21 @@
 ---
 license: mit
 ---
 # [Gated State Space](https://arxiv.org/abs/2206.13947)
-This repo contains pretrain model for the gated state space paper. The model has been trained on [C4 dataset](https://huggingface.co/datasets/c4). I have used [Lucidrains' implementation](https://github.com/lucidrains/gated-state-spaces-pytorch) ([commit](https://github.com/lucidrains/gated-state-spaces-pytorch/tree/32cd036e775112cc469e94fa1165fe111393708b)) for the model. I think the main benefit of this model is the ability to scale beyond the training context length. As authors noted in the paper, they trained the model on 4k sequence length but it generalized beyond that length. I have written a blog post on how I started the training [here](https://naxalpha.substack.com/p/devlog-experiment-a2a468-gated-state).
-[Here](https://wandb.ai/naxalpha/gated-state-space/reports/Gated-State-Space-Training-v1--VmlldzozMTYzMzY3?accessToken=zy10rrpofi9k7l52aqwiej8bk0ub302rdswfkxmf8y94dt2j6z4kxbca6ar3sc52) are the training logs (report) on W&B. Since the training is still going on, this repo currently contains the checkpoint @ ~200k step. The report contains both the loss and the output results. While the generated text somewhat makes sense, given the loss is still ~4.2, it is not useful out-of-the-box for most tasks. So need to fine-tune it before using it for anything else.
 ## How to use this.
@@ -38,6 +47,21 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
 ```
-## Training proceedure
-Primarily it has been trained by language model objective. However; I added a few tricks to further optimize the training. The main trick of using pretrained embeddings is explained in the Devlog blog post linked above. The batch size is 8 with a sequence length of 128, the optimizer is AdamW with a learning rate of 2e-5. However; it is trained using gradient accumulation of 4 so the effective batch size is 32. Training happens with two types of losses, one is simple cross entropy for the next token prediction and other is distillation loss from GPT2-xl. During training each loss is alternated. Gradient norm are also clipped at 1.0.

 ---
+library_name: lucidrains/gated-state-spaces-pytorch
 license: mit
+datasets:
+  - c4
+pipeline_tag: text-generation
+tags:
+  - text generation
+  - pytorch
+  - causal-lm
+  - gated-state-space
 ---
 # [Gated State Space](https://arxiv.org/abs/2206.13947)
+This repo contains pretrain model for the gated state space paper. The model has been trained on [C4 dataset](https://huggingface.co/datasets/c4). I have used [Lucidrains' implementation](https://github.com/lucidrains/gated-state-spaces-pytorch) ([commit](https://github.com/lucidrains/gated-state-spaces-pytorch/tree/32cd036e775112cc469e94fa1165fe111393708b)) for the model. I think the main benefit of this model is the ability to scale beyond the training context length. As authors noted in the paper, they trained the model on 4k sequence length but it generalized beyond that length. I have written a **blog post on how I started the training [here](https://naxalpha.substack.com/p/devlog-experiment-a2a468-gated-state)**.
+**[Wandb Report is available at this link](https://wandb.ai/naxalpha/gated-state-space/reports/Gated-State-Space-Training-v1--VmlldzozMTYzMzY3?accessToken=zy10rrpofi9k7l52aqwiej8bk0ub302rdswfkxmf8y94dt2j6z4kxbca6ar3sc52)**
 ## How to use this.
 ```
+## Training Information
+Here are the details of the training:
+- Objective: `Alternate between simple cross entropy and GPT-2 XL distillation`
+- Gradient Accumulation: `4`
+- Batch Size: `8`
+- Sequence Length `128`
+- Learning Rate: `2e-5`
+- Optimizer: `AdamW`
+- Gradient Norm Clipping: `1.0`
+- Hardware: `RTX 3090` on [vast.ai](vast.ai)
+- Training Cost: `~20$`
+- Training Time: `~2 days`
+- Number of steps: `434,000`
+- Tokens seen: `444 million`
+Training code is available in this repo. [Link to the training script](https://huggingface.co/naxalpha/gated-state-space/blob/main/app.py).