@SivilTaram on Hugging Face: "✨ Today, we're excited to share the full data processing script used in…"

Post

1744

✨ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. 🚀

💻Code: https://github.com/sail-sg/sailcraft
🤗Model: sail/sailor-language-models-65e19a749f978976f1959825
📜Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🌐Homepage: https://sailorllm.github.io

# Overview 🔍

The pipeline consists of 4 stages🧹:
1️⃣ Initial data cleaning
2️⃣ Near deduplication
3️⃣ Exact deduplication
4️⃣ Second round of data cleaning

A special focus was given to the data cleaning part of South-East Asian (SEA) languages🌍

# Use Case ✨

With this codebase, you can clean your own dataset with:

✅ Get filtered data counts after each processing stage
✅ Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay)
✅ Investigate what data was removed at each processing stage

# Acknowledgement 🙏

The main credit goes to @dreamerdeo , the first author of our Sailor paper ❤️! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. 🎉

Sharing the recipe openly aligns with our commitment to open language model development. 💪 And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. 🧠

# What's Next 🚀

Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. 🚄

Join the conversation