Nandan Thakur's picture

Nandan Thakur

nthakur

·

https://thakur-nandan.github.io

AI & ML interests

NLP, IR, QA

Recent Activity

upvoted a collection 2 days ago

updated a dataset 2 days ago

freshstack/leaderboard-results

liked a dataset 7 days ago

perplexity-ai/draco

View all activity

Organizations

Posts 2

Post

1861

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

Post

3779

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

Collections 5

View 5 collections

Papers 16

arxiv:2508.06600

arxiv:2505.16967

arxiv:2504.20006

arxiv:2504.13128

models 44

nthakur/qwen3-4b-grpo-modified-5-docs-infoseek-equal-mix-step-60

4B • Updated 8 days ago • 17

nthakur/qwen3-4b-grpo-modified-10-docs-odyssey-27k-step-60

4B • Updated Oct 2, 2025 • 2

nthakur/qwen3-4b-grpo-round-2-modified-10-docs-step-160

4B • Updated Sep 25, 2025

nthakur/qwen3-4b-grpo-mix-1-1-1-step-165

4B • Updated Sep 16, 2025

nthakur/qwen3-4b-grpo-infoseek-mix-1-1-1-step-25

4B • Updated Sep 15, 2025

nthakur/qwen3-4b-grpo-mix-1-2-4-step-225

4B • Updated Sep 10, 2025 • 2

nthakur/qwen3-4b-grpo-10-docs-modified-mix-1-1-1-step-385

4B • Updated Sep 8, 2025 • 1

nthakur/qwen3-4b-grpo-only-odyssey-step-210

4B • Updated Aug 27, 2025 • 2

nthakur/baseline-qwen3-4b-grpo-nq-hotpotqa-step-200

4B • Updated Aug 27, 2025

nthakur/baseline-qwen3-4b-ppo-nq-hotpotqa-step-200

4B • Updated Aug 20, 2025

datasets 64

nthakur/odyssey-20K

Viewer • Updated Dec 17, 2025 • 20.1k • 33

nthakur/odyssey-verified-27K-oracled-round-2

Viewer • Updated Dec 11, 2025 • 12.3k • 38

nthakur/odyssey-verified-hard-17K

Viewer • Updated Sep 16, 2025 • 17.5k • 4

nthakur/odyssey-verified-27K

Viewer • Updated Sep 13, 2025 • 27.1k • 29

nthakur/search-arena-v1-nuggets-with-urls-5k-qwen

Viewer • Updated Jul 29, 2025 • 5.1k • 14

nthakur/auto-browsecomp-18k

Viewer • Updated Jun 23, 2025 • 18k • 12

nthakur/auto-browsecomp-10k

Viewer • Updated Jun 17, 2025 • 9.88k • 15

nthakur/cornstack-6-langs-v1-tevatron-6M

Viewer • Updated Jun 3, 2025 • 5.92M • 74

nthakur/cornstack-php-v1-tevatron-1M

Viewer • Updated Jun 2, 2025 • 993k • 71

nthakur/cornstack-go-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 995k • 82

View 64 datasets