Update README.md
Browse files
README.md
CHANGED
@@ -52,16 +52,16 @@ Here is the table summarizing the architecture used for training, along with the
|
|
52 |
|
53 |
| Hyperparameter | Value |
|
54 |
|:---------------------:|:----------:|
|
55 |
-
|
|
56 |
-
|
|
57 |
-
|
|
58 |
-
|
|
59 |
-
|
|
60 |
-
|
|
61 |
-
|
|
62 |
-
|
|
63 |
| gradient accumulation | 200 |
|
64 |
-
|
|
65 |
|
66 |
Experimentations
|
67 |
----------------
|
|
|
52 |
|
53 |
| Hyperparameter | Value |
|
54 |
|:---------------------:|:----------:|
|
55 |
+
| label smoothing | 0.05 |
|
56 |
+
| optimize | AdamW |
|
57 |
+
| betas | 0.9, 0.999 |
|
58 |
+
| learning rate | 5e-6 |
|
59 |
+
| anneal strategy | cos |
|
60 |
+
| div factor | 100 |
|
61 |
+
| final div factor | 0.1 |
|
62 |
+
| batch size | 2 |
|
63 |
| gradient accumulation | 200 |
|
64 |
+
| max length | 2048 |
|
65 |
|
66 |
Experimentations
|
67 |
----------------
|