deepvk
/

roberta-base

@@ -39,7 +39,6 @@ inputs = tokenizer(text, return_tensors='pt')
 predictions = model(**inputs)
 ```
 ## Training Details
 ### Training Data
@@ -47,56 +46,54 @@ predictions = model(**inputs)
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 Mix of the following data:
 ### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-Standard RoBERTA-base size;
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Data Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Compute Infrastructure
-Model was trained using 8xA100 for ~22 days.

 predictions = model(**inputs)
 ```
 ## Training Details
 ### Training Data
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 Mix of the following data:
+* Wikipedia
+* Books
+* Twitter comments
+* Pikabu
+* Proza.ru
+* Film subtitles
+* News websites
+* Social corpus
+~500gb of raw texts
 ### Training Procedure
 #### Training Hyperparameters
+- **Training regime:** fp16 mixed precision
+- **Training framework:** Fairseq
+- **Optimizer:** Adam
+- **Adam betas:** 0.9,0.98
+- **Adam eps:** 1e-6
+- **Num training steps:** 500k
+- **Train batch size:** 4096
+Model was trained using 8xA100 for ~22 days.
+#### Architecture details
+Standard RoBERTa-base parameters:
+- **Activation function:** gelu
+- **Attention dropout:** 0.1
+- **Dropout:** 0.1
+- **Encoder attention heads:** 12
+- **Encoder embed dim:** 768
+- **Encoder ffn embed dim:** 3,072
+- **Encoder layers:** 12
+- **Max positions:** 512
+- **Vocab size:** 50266
+## Evaluation
+Результаты на Russian Super Glue dev
+| Модель             | RCB   | PARus | MuSeRC | TERRa | RUSSE | RWSD  | DaNetQA | Результат |
+|--------------------|-------|-------|--------|-------|-------|-------|---------|-----------|
+| vk-roberta-base    | 0.46  | 0.56  | 0.679  | 0.769 | 0.960 | 0.569 | 0.658   | 0.665     |
+| vk-deberta-distill | 0.433 | 0.56  | 0.625  | 0.59  | 0.943 | 0.569 | 0.726   | 0.635     |
+| vk-deberta-base    | 0.450 | 0.61  | 0.722  | 0.704 | 0.948 | 0.578 | 0.76    | 0.682     |
+| vk-bert-base       | 0.467 | 0.57  | 0.587  | 0.704 | 0.953 | 0.583 | 0.737   | 0.657     |
+| sber-roberta-large | 0.463 | 0.61  | 0.775  | 0.886 | 0.946 | 0.564 | 0.761   | 0.715     |
+| sber-bert-base     | 0.491 | 0.61  | 0.663  | 0.769 | 0.962 | 0.574 | 0.678   | 0.678     |