zake7749
/

gemma-2-2b-it-chinese-kyara-dpo

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

zake7749 commited on Oct 13

Commit

84a6895

•

1 Parent(s): 1bcfbf0

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -110,6 +110,8 @@ We have collected a total of 3.6M conversations, approximately 4.51 billion toke
 ### Dataset Construction
 #### Base Dataset: Knowledge Injection with Retrieval Augmentation
 We developed a knowledge search system using open Chinese knowledge corpora, integrated with [QDrant](https://qdrant.tech/). To construct Supervised Fine-Tuning(SFT) pairs, we followed this process:

 ### Dataset Construction
+The data construction for Kyara is divided into two parts: English and Chinese. For the English part, we have incorporated multiple high-quality open-source datasets, such as `teknium/OpenHermes-2.5` and `arcee-ai/The-Tome`, and performing semantic deduplication to drop out near-similar examples. As for the Chinese part, the construction follows the process outlined below:
 #### Base Dataset: Knowledge Injection with Retrieval Augmentation
 We developed a knowledge search system using open Chinese knowledge corpora, integrated with [QDrant](https://qdrant.tech/). To construct Supervised Fine-Tuning(SFT) pairs, we followed this process: