update info
Browse files
README.md
CHANGED
@@ -46,6 +46,7 @@ Since it is not based on [transformers](https://github.com/huggingface/transform
|
|
46 |
model.net.to_logits[1].weight.copy_(emb)
|
47 |
```
|
48 |
|
|
|
49 |
|
50 |
## Training Information
|
51 |
|
@@ -65,4 +66,16 @@ Here are the details of the training:
|
|
65 |
- Tokens seen: `570 million`
|
66 |
- Final loss: `~3.9`
|
67 |
|
68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
model.net.to_logits[1].weight.copy_(emb)
|
47 |
```
|
48 |
|
49 |
+
Training code is available in this repo. [Link to the training script](https://huggingface.co/naxalpha/gated-state-space/blob/main/app.py).
|
50 |
|
51 |
## Training Information
|
52 |
|
|
|
66 |
- Tokens seen: `570 million`
|
67 |
- Final loss: `~3.9`
|
68 |
|
69 |
+
## Fine-Tuning Info:
|
70 |
+
|
71 |
+
[model2.pt](https://huggingface.co/naxalpha/gated-state-space/blob/main/) is available as fine-tuned version with longer context length.
|
72 |
+
|
73 |
+
- Objective: `Simple Cross Entropy`
|
74 |
+
- Gradient Accumulation: `4`
|
75 |
+
- Batch Size: `1`
|
76 |
+
- Sequence Length: `2048`
|
77 |
+
- Learning Rate: `5e-6`
|
78 |
+
- Embeddings: `unfrozen for fine-tuning`
|
79 |
+
- Gradient Norm Clipping: `1.0`
|
80 |
+
- Hardware: `2x3090` on vast.ai
|
81 |
+
- Extra Tricks: `Used HuggingFace Accelerate with Full Sharding without CPU offload`
|