metadata
license: apache-2.0
datasets:
- Skywork/SkyPile-150B
- ticoAg/shibing624-medical-pretrain
- togethercomputer/RedPajama-Data-V2
- medalpaca/medical_meadow_wikidoc
- nlp-guild/medical-data
language:
- en
- zh
pipeline_tag: text-classification
fasttext-med-en-zh-identification
This model is an intermediate result of the EPCD (Easy-Data-Clean-Pipeline) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses fastText.
Data Composition
General Chinese Pretraining Dataset
Medical Chinese Pretraining Dataset
General English Pretraining Dataset
Medical English Pretraining Datasets
The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community!
Data Cleaning Process
Initial dataset processing:
- For the Chinese training datasets, the pretraining corpus is split by
\n
, and any leading or trailing spaces are removed. - For the English training datasets, the pretraining corpus is split by
\n
, all letters are converted to lowercase, and any leading or trailing spaces are removed.
- For the Chinese training datasets, the pretraining corpus is split by
Word count statistics:
Sample filtering based on word count (heuristic thresholds):
- For Chinese: Keep only samples with more than 5 words.
- For English: Keep only samples with more than 5 words.
Dataset splitting: 90% of the data is used for training and 10% for testing.
Model Performance
Dataset | Precision | Recall |
---|---|---|
Train | 0.9987 | 0.9987 |
Test | 0.9962 | 0.9962 |
Usage Example
import fasttext
from huggingface_hub import hf_hub_download
def to_low(text):
return text.strip().lower()
model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model(model_path)
model.predict(to_low('Hello, world!'))
fasttext-med-en-zh-identification
该模型为EPCD(Easy-Data-Clean-Pipeline)项目的中间产物,主要用来区分医疗预训练语料中中文与英文样本。模型框架使用fastText。
数据组成
中文通用预训练数据集
中文医疗预训练数据集
英文通用预训练数据集
英文医疗预训练数据集
上述数据集均为高质量开源数据集,可以节省很多数据清洗的工作,感谢上述开发者对开源数据社区的支持!
数据清洗流程
- 数据集初步整理
- 对中文训练数据集,按
\n
分割预训练语料,去除开头和结尾可能存在的空格。 - 对英文训练数据集,按
\n
分割预训练语料,将所有字母全部变为小写,去除开头和结尾可能存在的空格。
- 对中文训练数据集,按
- 统计词数量,具体的:
- 根据词数量进行样本过滤,具体的(经验数值):
- 对中文:仅保留词数量大于5的样本。
- 对英文:仅保留词数量大于5的样本。
- 切分数据集,训练集比例为0.9,测试集比例为0.1。
模型表现
Dataset | Accuracy |
---|---|
Train | 0.9994 |
Test | 0.9998 |
Usage Example
import fasttext
from huggingface_hub import hf_hub_download
def to_low(text):
return text.strip().lower()
model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model(model_path)
model.predict(to_low('Hello, world!'))