Update README.md
Browse files
README.md
CHANGED
@@ -6828,7 +6828,42 @@ but low-resource languages may see performance degradation.
|
|
6828 |
|
6829 |
## Training Details
|
6830 |
|
6831 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6832 |
|
6833 |
## Benchmark Evaluation
|
6834 |
|
|
|
6828 |
|
6829 |
## Training Details
|
6830 |
|
6831 |
+
**Initialization**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
|
6832 |
+
|
6833 |
+
**First stage**: contrastive pre-training with weak supervision
|
6834 |
+
|
6835 |
+
| Dataset | Weak supervision | # of text pairs |
|
6836 |
+
|--------------------------------------------------------------------------------------------------------|---------------------------------------|-----------------|
|
6837 |
+
| Filtered [mC4](https://huggingface.co/datasets/mc4) | (title, page content) | 1B |
|
6838 |
+
| [CC News](https://huggingface.co/datasets/intfloat/multilingual_cc_news) | (title, news content) | 400M |
|
6839 |
+
| [NLLB](https://huggingface.co/datasets/allenai/nllb) | translation pairs | 2.4B |
|
6840 |
+
| [Wikipedia](https://huggingface.co/datasets/intfloat/wikipedia) | (hierarchical section title, passage) | 150M |
|
6841 |
+
| Filtered [Reddit](https://www.reddit.com/) | (comment, response) | 800M |
|
6842 |
+
| [S2ORC](https://github.com/allenai/s2orc) | (title, abstract) and citation pairs | 100M |
|
6843 |
+
| [Stackexchange](https://stackexchange.com/) | (question, answer) | 50M |
|
6844 |
+
| [xP3](https://huggingface.co/datasets/bigscience/xP3) | (input prompt, response) | 80M |
|
6845 |
+
| [Miscellaneous unsupervised SBERT data](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | - | 10M |
|
6846 |
+
|
6847 |
+
**Second stage**: supervised fine-tuning
|
6848 |
+
|
6849 |
+
| Dataset | Language | # of text pairs |
|
6850 |
+
|----------------------------------------------------------------------------------------|--------------|-----------------|
|
6851 |
+
| [MS MARCO](https://microsoft.github.io/msmarco/) | English | 500k |
|
6852 |
+
| [NQ](https://github.com/facebookresearch/DPR) | English | 70k |
|
6853 |
+
| [Trivia QA](https://github.com/facebookresearch/DPR) | English | 60k |
|
6854 |
+
| [NLI from SimCSE](https://github.com/princeton-nlp/SimCSE) | English | <300k |
|
6855 |
+
| [ELI5](https://huggingface.co/datasets/eli5) | English | 500k |
|
6856 |
+
| [DuReader Retrieval](https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval) | Chinese | 86k |
|
6857 |
+
| [KILT Fever](https://huggingface.co/datasets/kilt_tasks) | English | 70k |
|
6858 |
+
| [KILT HotpotQA](https://huggingface.co/datasets/kilt_tasks) | English | 70k |
|
6859 |
+
| [SQuAD](https://huggingface.co/datasets/squad) | English | 87k |
|
6860 |
+
| [Quora](https://huggingface.co/datasets/quora) | English | 150k |
|
6861 |
+
| [Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) | 11 languages | 50k |
|
6862 |
+
| [MIRACL](https://huggingface.co/datasets/miracl/miracl) | 16 languages | 40k |
|
6863 |
+
|
6864 |
+
For all labeled datasets, we only use its training set for fine-tuning.
|
6865 |
+
|
6866 |
+
For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
|
6867 |
|
6868 |
## Benchmark Evaluation
|
6869 |
|