Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval
Paper
•
2311.05800
•
Published
•
3
29 million Synthetic Wikipedia-based Multilingual Retrieval Training Pairs.
Note SWIM-IR (Cross-lingual) dataset, where the query is in the target language and the passage is in English.
Note SWIM-IR (Monolingual) dataset, where both the query and the passage are in the target language.
Note Indic SWIM-IR (Cross-lingual) dataset, where the query is in the Indo-European language and the passage is in English.