arxiv:2501.08197

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Published on Jan 14

· Submitted by

yuyijiong on Jan 15

Upvote

Authors:

Yijiong Yu ,

Ziyun Dai ,

Zekun Wang ,

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.

View arXiv page View PDF Add to collection

Community

yuyijiong

Paper author Paper submitter 1 day ago

We build a suite of high-quality open-source training dataset in Chinese, for all the procedures of training a Chinese LLM from scratch.
Click this link to download them.

librarian-bot

about 11 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.08197 in a model README.md to link it from this page.

Datasets citing this paper 5

Browse 5 datasets citing this paper

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Abstract

Community

Models citing this paper 0

Datasets citing this paper 5

Spaces citing this paper 1

Collections including this paper 4