Traditional Chinese LLM Corpus
Traditional Chinese corpus collection for LLM training (pre-training, instruction-tuning, and RLHF/alignment).
Viewer • Updated • 1.78M • 76 • 12Note Contains ~2B tokens from high quality corpus. Cleaned and deduplicated.
liswei/wikipedia-zhtw-dedup
Viewer • Updated • 1.18M • 54 • 1Note Deduplicate version of erhwenkuo/wikipedia-zhtw using MinHash.
liswei/c4-zhtw
Viewer • Updated • 4.86M • 109 • 1Note Deduplicated C4 subset of zhTW. Note: C4 = colossal, cleaned version of Common Crawl
liswei/common-crawl-zhtw
Viewer • Updated • 2.71M • 103 • 3Note Deduplicated CC subset of zhTW.
zetavg/CC-100-zh-Hant-merged
Viewer • Updated • 12.3M • 120 • 3Note Zh-tw subset of CC-100 dataset, which is derived from commoncrawl. Note: CC harms performance as shown in TaiwanLlama.
liswei/coct-en-zhtw-dedup
Viewer • Updated • 217k • 47 • 1Note Deduplicate version of zetavg/coct-en-zh-tw-translations-twp-300k. Zh-tw <-> en paired articles provided by 台灣光華雜誌.
liswei/PromptPair-TW
Viewer • Updated • 119k • 37 • 2Note Traditional Chinese instruction dataset. Contains en <-> tw pairs with system prompts to better adopt from English pre-trained models.
yentinglin/TaiwanChat
Viewer • Updated • 485k • 202 • 53Note Instruction dataset used to train TaiwanLLM v1. Find more details in the paper.
erhwenkuo/alpaca-data-gpt4-chinese-zhtw
Viewer • Updated • 52k • 57 • 6Note Translated from en to zh-tw of the alpaca-gpt4 dataset.
zetavg/mlqa_en_zh_tw
Viewer • Updated • 3.29k • 41 • 7Note zhcn/en multilingual QA translated to zhtw/en. Internal experiment shows that when transferring from English base model, traning on Q:en->A:zh or vice versa improves SFT performance.
zetavg/ShareGPT-Processed
Viewer • Updated • 90.7k • 78 • 29Note The RyokoAI/ShareGPT52K dataset, converted to Markdown and labeled with the language used.
benchang1110/PTT_QA
Updated • 7 • 1
lchakkei/OpenOrca-Traditional-Chinese
Viewer • Updated • 4.23M • 459 • 8Note Google translated instruction data from English.
Heng666/Traditional_Chinese-aya_dataset
Viewer • Updated • 4.91k • 143 • 3Heng666/Traditional_Chinese-aya_evaluation_suite
Viewer • Updated • 650 • 59 • 3
ChenWeiLi/Med_Breexe_zhtw
Viewer • Updated • 1.6k • 33 • 4Note Instruction dataset in the Medicine domain. Prompts are translated then feed to Breexe model.
Tarklanse/Traditional_Chinese_roleplay_chat_Dataset
Viewer • Updated • 9.51k • 78 • 38DataAgent/Pretrain-Taiwan-DentistKnowledge-zhTW-290K
Viewer • Updated • 147 • 39 • 2
KSmart/chinese_traditional_chengyu
Viewer • Updated • 111 • 32 • 3Note This is in Simplified Chinese.
liswei/rm-static-zhTW
Viewer • Updated • 81.4k • 36 • 30Note Perference dataset with chosen/reject pair. Translated using m2m100.
ZoneTwelve/ChineseGrammaticalErrorEvaluation
Viewer • Updated • 132 • 52ZoneTwelve/micro_sft_instruct
Viewer • Updated • 10 • 65