This model uses llama3-8b as the base model and uses the BAAI/IndustryCorpus2 dataset for data matching and domain pre-training to obtain a medical field pre-training model with Chinese and English capabilities.
trainig details
To gradually align the data distribution between pre-training and fine-tuning and minimize the loss of knowledge acquired during pre-training, we design a novel two-stage CPT strategy. This approach ensures a stable integration of medical knowledge into the LLM.
Stable CPT
To balance medical domain knowledge with general knowledge, we first implement a Stable CPT stage, which ensures the model maintains and enhances its general language understanding while concentrating on medical information. In this stage, we combine a high-quality medical pre-training corpus with general data via the ratio as 19:1, with a token-level distribution of 1:9 for Chinese:English.
Boost CPT
To integrate medical knowledge during the model pre-training phase and facilitate a smooth transition to domain-specific tasks, we then design a Boost CPT phase. In this phase, we combine a very high-quality medical pre-training corpus with open-source medical SFT data at a 1:1 ratio, with a token-level distribution of 4:6 for Chinese:English. Notably, throughout these two phases, we progressively increase the proportion of Chinese data.
Model Evaluation result
we evaluate our CPT model, CareBot, on seven common medical benchmarks. Considering that our goal is to train a medical model that performs well in both Chinese and English, we strive to improve the Chinese medical ability while ensuring that the English medical ability of the model is slightly reduced. We observe that for English benchmarks (MMLU-Med, PubMedQA, MedQA, MedMCQA), the performance of CareBot (Stable CPT) and CareBot (Stable CPT & Boost CPT) shows a slight decrease. This is expected, given that the LLaMA-8B-base model already has strong English capabilities. However, for Chinese benchmarks (C-Eval-Med, CMMLU-Med, CMB), our models demonstrate significant improvements, with particularly notable gains in models trained using the two-stage approach. This confirms that our two-stage CPT strategy effectively integrates medical domain knowledge into the model, resulting in robust enhancements to its Chinese medical capabilities.
bellow is the metric details
Citation
@misc{
title={CareBot: A Pioneering Full-Process Open-Source Medical Language Model},
author={Lulu Zhao and Weihao Zeng and Xiaofeng Shi and Hua Zhou and Yonghua Lin},
year={2024},
eprint={},
archivePrefix={arXiv},
primaryClass={}
}
- Downloads last month
- 8