Update README.md
Browse files
README.md
CHANGED
@@ -50,6 +50,20 @@ Here is the table summarizing the architecture used for training, along with the
|
|
50 |
| [bloomz-3b-sft-chat](https://huggingface.co/cmarkea/bloomz-3b-sft-chat) | 1 x A100 40GB | 140 | 13 |
|
51 |
| [bloomz-7b1-mt-sft-chat](https://huggingface.co/cmarkea/bloomz-7b1-mt-sft-chat) | 4 x A100 40GB | 268 | 8 |
|
52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
Experimentations
|
54 |
----------------
|
55 |
Since the model is trained only on English and French corpora, the performance of the model cannot be guaranteed in other languages. This degradation in performance in other languages is also due to the change in the model's data type from float16 to bfloat16. The conversation example below illustrates this point:
|
|
|
50 |
| [bloomz-3b-sft-chat](https://huggingface.co/cmarkea/bloomz-3b-sft-chat) | 1 x A100 40GB | 140 | 13 |
|
51 |
| [bloomz-7b1-mt-sft-chat](https://huggingface.co/cmarkea/bloomz-7b1-mt-sft-chat) | 4 x A100 40GB | 268 | 8 |
|
52 |
|
53 |
+
| Hyperparameter | Value |
|
54 |
+
|:---------------------:|:----------:|
|
55 |
+
| label smoothing | 0.05 |
|
56 |
+
| optimize | AdamW |
|
57 |
+
| betas | 0.9, 0.999 |
|
58 |
+
| AMSGrad | True |
|
59 |
+
| learning rate | 5e-6 |
|
60 |
+
| anneal strategy | cos |
|
61 |
+
| div factor | 100 |
|
62 |
+
| final div factor | 0.1 |
|
63 |
+
| batch size | 16 |
|
64 |
+
| gradient accumulation | 25 |
|
65 |
+
| max length | 1500 |
|
66 |
+
|
67 |
Experimentations
|
68 |
----------------
|
69 |
Since the model is trained only on English and French corpora, the performance of the model cannot be guaranteed in other languages. This degradation in performance in other languages is also due to the change in the model's data type from float16 to bfloat16. The conversation example below illustrates this point:
|