|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Skywork/SkyPile-150B |
|
- ticoAg/shibing624-medical-pretrain |
|
- togethercomputer/RedPajama-Data-V2 |
|
- medalpaca/medical_meadow_wikidoc |
|
- nlp-guild/medical-data |
|
language: |
|
- en |
|
- zh |
|
pipeline_tag: text-classification |
|
--- |
|
# fasttext-med-en-zh-identification |
|
|
|
[[中文]](#chinese) [[English]](#english) |
|
|
|
<a id="english"></a> |
|
|
|
This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText). |
|
|
|
## Data Composition |
|
|
|
### General Chinese Pretraining Dataset |
|
- [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B) |
|
|
|
### Medical Chinese Pretraining Dataset |
|
- [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain) |
|
|
|
### General English Pretraining Dataset |
|
- [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) |
|
|
|
### Medical English Pretraining Datasets |
|
- [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc) |
|
- [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data) |
|
|
|
The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community! |
|
|
|
## Data Cleaning Process |
|
|
|
- Initial dataset processing: |
|
- For the Chinese training datasets, the pretraining corpus is split by `\n`, and any leading or trailing spaces are removed. |
|
- For the English training datasets, the pretraining corpus is split by `\n`, all letters are converted to lowercase, and any leading or trailing spaces are removed. |
|
|
|
- Word count statistics: |
|
- For Chinese, the [jieba](https://github.com/fxsjy/jieba) package is used for tokenization, and stopwords and non-Chinese characters are further filtered using [jionlp](https://github.com/dongrixinyu/JioNLP). |
|
- For English, the [nltk](https://github.com/nltk/nltk) package is used for tokenization, with built-in stopwords for filtering. |
|
|
|
- Sample filtering based on word count (heuristic thresholds): |
|
- For Chinese: Keep only samples with more than 5 words. |
|
- For English: Keep only samples with more than 5 words. |
|
|
|
- Dataset splitting: 90% of the data is used for training and 10% for testing. |
|
|
|
## Model Performance |
|
|
|
| Dataset | Precision | Recall | |
|
|---------|----------|----------| |
|
| Train | 0.9987 | 0.9987 | |
|
| Test | 0.9962 | 0.9962 | |
|
|
|
## Usage Example |
|
```python |
|
import fasttext |
|
from huggingface_hub import hf_hub_download |
|
|
|
def to_low(text): |
|
return text.strip().lower() |
|
|
|
model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin") |
|
model = fasttext.load_model(model_path) |
|
model.predict(to_low('Hello, world!')) |
|
``` |
|
|
|
# fasttext-med-en-zh-identification |
|
|
|
[[中文]](#chinese) [[English]](#english) |
|
|
|
<a id="chinese"></a> |
|
|
|
该模型为[EPCD(Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP)项目的中间产物,主要用来区分医疗预训练语料中中文与英文样本。模型框架使用[fastText](https://github.com/facebookresearch/fastText)。 |
|
|
|
# 数据组成 |
|
|
|
## 中文通用预训练数据集 |
|
- [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B) |
|
## 中文医疗预训练数据集 |
|
- [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain) |
|
|
|
## 英文通用预训练数据集 |
|
- [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) |
|
## 英文医疗预训练数据集 |
|
- [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc) |
|
- [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data) |
|
|
|
上述数据集均为高质量开源数据集,可以节省很多数据清洗的工作,感谢上述开发者对开源数据社区的支持! |
|
|
|
# 数据清洗流程 |
|
- 数据集初步整理 |
|
- 对中文训练数据集,按`\n`分割预训练语料,去除开头和结尾可能存在的空格。 |
|
- 对英文训练数据集,按`\n`分割预训练语料,将所有字母全部变为小写,去除开头和结尾可能存在的空格。 |
|
- 统计词数量,具体的: |
|
- 对中文,使用[jieba](https://github.com/fxsjy/jieba)包进行分词,并利用[jionlp](https://github.com/dongrixinyu/JioNLP)进一步过滤停用词和非中文字符。 |
|
- 对英文,使用[nltk](https://github.com/nltk/nltk)包进行分词,并利用内置停用词进行过滤。 |
|
- 根据词数量进行样本过滤,具体的(经验数值): |
|
- 对中文:仅保留词数量大于5的样本。 |
|
- 对英文:仅保留词数量大于5的样本。 |
|
- 切分数据集,训练集比例为0.9,测试集比例为0.1。 |
|
|
|
# 模型表现 |
|
|Dataset | Accuracy | |
|
|-------|-------| |
|
|Train | 0.9994| |
|
|Test | 0.9998| |
|
|
|
## Usage Example |
|
```python |
|
import fasttext |
|
from huggingface_hub import hf_hub_download |
|
|
|
def to_low(text): |
|
return text.strip().lower() |
|
|
|
model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin") |
|
model = fasttext.load_model(model_path) |
|
model.predict(to_low('Hello, world!')) |
|
``` |