Update README.md

![Screenshot 2023-01-21 at 11.36.52.png](https://s3.amazonaws.com/moonup/production/uploads/1674290304538-5f91b1208a61a359f44e1851.png)

Files changed (1) hide show

README.md CHANGED Viewed

@@ -10,12 +10,18 @@ Architecture based on T5.
 It has 24 layers and 1536 hidden size.
-Model was trained on a mixture of 7 denoisers like UL2 with several differences .
 It trained on Russian language corpus (300GB).   Dataset is the same as for ruT5 models.
-Bbpe tokenizer. First half of the time model was trained on the small part of all datasets (1%).
 We continue to experiment...

 It has 24 layers and 1536 hidden size.
+Model trained on a mixture of 7 denoisers like UL2 with several differences .
 It trained on Russian language corpus (300GB).   Dataset is the same as for ruT5 models.
+Bbpe tokenizer.
+First half of the time model trained on the small part of all datasets (1%,3GB) and without prefixes in each task.
+For RSG we trained as described in the T5 paper. First, we trained multitask for all tasks. Then we took the best checkpoint for the task and trained it further.
+Training loss:
+![Screenshot 2023-01-21 at 11.36.52.png](https://s3.amazonaws.com/moonup/production/uploads/1674290304538-5f91b1208a61a359f44e1851.png)
 We continue to experiment...