intfloat commited on
Commit
0baed7b
1 Parent(s): 3bd751f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -1
README.md CHANGED
@@ -6828,7 +6828,42 @@ but low-resource languages may see performance degradation.
6828
 
6829
  ## Training Details
6830
 
6831
- Please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6832
 
6833
  ## Benchmark Evaluation
6834
 
 
6828
 
6829
  ## Training Details
6830
 
6831
+ **Initialization**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
6832
+
6833
+ **First stage**: contrastive pre-training with weak supervision
6834
+
6835
+ | Dataset | Weak supervision | # of text pairs |
6836
+ |--------------------------------------------------------------------------------------------------------|---------------------------------------|-----------------|
6837
+ | Filtered [mC4](https://huggingface.co/datasets/mc4) | (title, page content) | 1B |
6838
+ | [CC News](https://huggingface.co/datasets/intfloat/multilingual_cc_news) | (title, news content) | 400M |
6839
+ | [NLLB](https://huggingface.co/datasets/allenai/nllb) | translation pairs | 2.4B |
6840
+ | [Wikipedia](https://huggingface.co/datasets/intfloat/wikipedia) | (hierarchical section title, passage) | 150M |
6841
+ | Filtered [Reddit](https://www.reddit.com/) | (comment, response) | 800M |
6842
+ | [S2ORC](https://github.com/allenai/s2orc) | (title, abstract) and citation pairs | 100M |
6843
+ | [Stackexchange](https://stackexchange.com/) | (question, answer) | 50M |
6844
+ | [xP3](https://huggingface.co/datasets/bigscience/xP3) | (input prompt, response) | 80M |
6845
+ | [Miscellaneous unsupervised SBERT data](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | - | 10M |
6846
+
6847
+ **Second stage**: supervised fine-tuning
6848
+
6849
+ | Dataset | Language | # of text pairs |
6850
+ |----------------------------------------------------------------------------------------|--------------|-----------------|
6851
+ | [MS MARCO](https://microsoft.github.io/msmarco/) | English | 500k |
6852
+ | [NQ](https://github.com/facebookresearch/DPR) | English | 70k |
6853
+ | [Trivia QA](https://github.com/facebookresearch/DPR) | English | 60k |
6854
+ | [NLI from SimCSE](https://github.com/princeton-nlp/SimCSE) | English | <300k |
6855
+ | [ELI5](https://huggingface.co/datasets/eli5) | English | 500k |
6856
+ | [DuReader Retrieval](https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval) | Chinese | 86k |
6857
+ | [KILT Fever](https://huggingface.co/datasets/kilt_tasks) | English | 70k |
6858
+ | [KILT HotpotQA](https://huggingface.co/datasets/kilt_tasks) | English | 70k |
6859
+ | [SQuAD](https://huggingface.co/datasets/squad) | English | 87k |
6860
+ | [Quora](https://huggingface.co/datasets/quora) | English | 150k |
6861
+ | [Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) | 11 languages | 50k |
6862
+ | [MIRACL](https://huggingface.co/datasets/miracl/miracl) | 16 languages | 40k |
6863
+
6864
+ For all labeled datasets, we only use its training set for fine-tuning.
6865
+
6866
+ For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
6867
 
6868
  ## Benchmark Evaluation
6869