Teja-Gollapudi commited on
Commit
289b600
1 Parent(s): c51d22f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -16
README.md CHANGED
@@ -1,14 +1,28 @@
1
 
 
 
 
 
 
 
 
 
 
 
 
2
  # tinyroberta-mrqa
3
 
4
  This is the *distilled* version of the [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa) model. This model has a comparable prediction quality to the base model and runs twice as fast.
5
 
6
  ## Overview
7
- **Language model:** tinyroberta-mrqa
8
- **Language:** English
9
- **Downstream-task:** Extractive QA
10
- **Training data:** MRQA
11
- **Eval data:** MRQA
 
 
 
12
 
13
  ## Hyperparameters
14
 
@@ -35,11 +49,11 @@ We have finetuned on the MRQA training set.
35
  num_train_epochs=3,
36
  weight_decay=0.01,
37
  per_device_train_batch_size=16,
38
- n_gpus = 3
39
  ```
40
 
41
  ## Distillation
42
- This model is inspired by deepset/tinyroberta-squad2.
43
  We start with a base checkpoint of [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) and perform further task prediction layer distillation on [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa).
44
  We then fine-tune it on MRQA.
45
 
@@ -54,8 +68,8 @@ model_name = "VMware/tinyroberta-mrqa"
54
  # a) Get predictions
55
  nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
56
  QA_input = {
57
- 'question': '',
58
- 'context': ''
59
  }
60
  res = nlp(QA_input)
61
 
@@ -64,15 +78,42 @@ model = AutoModelForQuestionAnswering.from_pretrained(model_name)
64
  tokenizer = AutoTokenizer.from_pretrained(model_name)
65
  ```
66
 
67
- ## Performance
68
 
69
- We have Evaluated the model on the MRQA dev set and test set using SQUAD metrics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
- ```
72
- eval exact match: 69.2
73
- eval f1 score: 79.6
74
 
75
- test exact match: 52.8
76
- test f1 score: 63.4
77
 
78
  ```
 
1
 
2
+ ---
3
+ license: apache-2.0
4
+ datasets:
5
+ - mrqa
6
+ language:
7
+ - en
8
+ metrics:
9
+ - exact_match
10
+ - f1
11
+ ---
12
+
13
  # tinyroberta-mrqa
14
 
15
  This is the *distilled* version of the [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa) model. This model has a comparable prediction quality to the base model and runs twice as fast.
16
 
17
  ## Overview
18
+ - **Model name:** tinyroberta-mrqa
19
+ - **Model type:** Extractive Question Answering
20
+ - **Teacher Model:** [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa)
21
+ - **Training dataset:** [MRQA](https://huggingface.co/datasets/mrqa) (Machine Reading for Question Answering)
22
+ - **Training data size:** 516,819 examples
23
+ - **Language:** English
24
+ - **Framework:** PyTorch
25
+ - **Model version:** 1.0
26
 
27
  ## Hyperparameters
28
 
 
49
  num_train_epochs=3,
50
  weight_decay=0.01,
51
  per_device_train_batch_size=16,
52
+ n_gpus = 1
53
  ```
54
 
55
  ## Distillation
56
+ This model is inspired by [deepset/tinyroberta-squad2](https://huggingface.co/deepset/tinyroberta-squad2) and the TinyBERT paper.
57
  We start with a base checkpoint of [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) and perform further task prediction layer distillation on [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa).
58
  We then fine-tune it on MRQA.
59
 
 
68
  # a) Get predictions
69
  nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
70
  QA_input = {
71
+ context = "We present the results of the Machine Reading for Question Answering (MRQA) 2019 shared task on evaluating the generalization capabilities of reading comprehension systems. In this task, we adapted and unified 18 distinct question answering datasets into the same format. Among them, six datasets were made available for training, six datasets were made available for development, and the final six were hidden for final evaluation. Ten teams submitted systems, which explored various ideas including data sampling, multi-task learning, adversarial training and ensembling. The best system achieved an average F1 score of 72.5 on the 12 held-out datasets, 10.7 absolute points higher than our initial baseline based on BERT."
72
+ question = "What is MRQA?"
73
  }
74
  res = nlp(QA_input)
75
 
 
78
  tokenizer = AutoTokenizer.from_pretrained(model_name)
79
  ```
80
 
81
+ # Model Family Performance
82
 
83
+ | Parent Language Model | Number of Parameters | Training Time | Eval Time | Test Time | Eval EM | Eval F1 | Test EM | Test F1 |
84
+ |---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
85
+ | BERT-Tiny | 4,369,666 | 26:11 | 0:41 | 0:04 | 22.78 | 32.42 | 10.18 | 18.72 |
86
+ | BERT-Base | 108,893,186 | 8:39:10 | 18:42 | 2:13 | 64.48 | 76.14 | 48.89 | 59.89 |
87
+ | BERT-Large | 334,094,338 | 28:35:38 | 1:00:56 | 7:14 | 69.52 | 80.50 | 55.00 | 65.78 |
88
+ | DeBERTa-v3-Extra-Small | 70,682,882 | 5:19:05 | 11:29 | 1:16 | 65.58 | 77.17 | 50.92 | 62.58 |
89
+ | DeBERTa-v3-Base | 183,833,090 | 12:13:41 | 28:18 | 3:09 | 71.43 | 82.59 | 59.49 | 70.46 |
90
+ | DeBERTa-v3-Large | 434,014,210 | 38:36:13 | 1:25:47 | 9:33 | **76.08** | **86.23** | **64.27** | **75.22** |
91
+ | ELECTRA-Small | 13,483,522 | 2:16:36 | 3:55 | 0:27 | 57.63 | 69.38 | 38.68 | 51.56 |
92
+ | ELECTRA-Base | 108,893,186 | 8:40:57 | 18:41 | 2:12 | 68.78 | 80.16 | 54.70 | 65.80 |
93
+ | ELECTRA-Large-Discriminator | 334,094,338 | 28:31:59 | 1:00:40 | 7:13 | 74.15 | 84.96 | 62.35 | 73.28 |
94
+ | MiniLMv2-L6-H384-from-BERT-Large | 22,566,146 | 2:12:48 | 4:23 | 0:40 | 59.31 | 71.09 | 41.78 | 53.30 |
95
+ | MiniLMv2-L6-H768-from-BERT-Large | 66,365,954 | 4:42:59 | 10:01 | 1:10 | 64.27 | 75.84 | 49.05 | 59.82 |
96
+ | MiniLMv2-L6-H384-from-RoBERTa-Large | 30,147,842 | 2:15:10 | 4:19 | 0:30 | 59.27 | 70.64 | 42.95 | 54.03 |
97
+ | MiniLMv2-L12-H384-from-RoBERTa-Large | 40,794,626 | 4:14:22 | 8:27 | 0:58 | 64.58 | 76.23 | 51.28 | 62.83 |
98
+ | MiniLMv2-L6-H768-from-RoBERTa-Large | 81,529,346 | 4:39:02 | 9:34 | 1:06 | 65.80 | 77.17 | 51.72 | 63.27 |
99
+ | RoBERTa-Base | 124,056,578 | 8:50:29 | 18:59 | 2:11 | 69.06 | 80.08 | 55.53 | 66.49 |
100
+ | RoBERTa-Large | 354,312,194 | 29:16:06 | 1:01:10 | 7:04 | 74.08 | 84.38 | 62.20 | 72.88 |
101
+ |TinyRoBERTa | 81,529.346 | ---| 9:54 | 1:04 | 69.38 | 80.07| 53.29| 64.16|
102
+
103
+
104
+
105
+ # Limitations and Bias
106
+
107
+ The model is based on a large and diverse dataset, but it may still have limitations and biases in certain areas. Some limitations include:
108
+
109
+ - Language: The model is designed to work with English text only and may not perform as well on other languages.
110
+
111
+ - Domain-specific knowledge: The model has been trained on a general dataset and may not perform well on questions that require domain-specific knowledge.
112
+
113
+ - Out-of-distribution questions: The model may struggle with questions that are outside the scope of the MRQA dataset. This is best demonstrated by the delta between its scores on the eval vs test datasets.
114
+
115
+ In addition, the model may have some bias in terms of the data it was trained on. The dataset includes questions from a variety of sources, but it may not be representative of all populations or perspectives. As a result, the model may perform better or worse for certain types of questions or on certain types of texts.
116
 
 
 
 
117
 
 
 
118
 
119
  ```