Update README.md

0e06b1c verified 2 months ago

5.47 kB

	---
	license: apache-2.0
	datasets:
	- Skywork/SkyPile-150B
	- ticoAg/shibing624-medical-pretrain
	- togethercomputer/RedPajama-Data-V2
	- medalpaca/medical_meadow_wikidoc
	- nlp-guild/medical-data
	language:
	- en
	- zh
	pipeline_tag: text-classification
	---
	# fasttext-med-en-zh-identification

	[[中文]](#chinese) [[English]](#english)

	<a id="english"></a>

	This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText).

	## Data Composition

	### General Chinese Pretraining Dataset
	- [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B)

	### Medical Chinese Pretraining Dataset
	- [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain)

	### General English Pretraining Dataset
	- [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)

	### Medical English Pretraining Datasets
	- [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc)
	- [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data)

	The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community!

	## Data Cleaning Process

	- Initial dataset processing:
	- For the Chinese training datasets, the pretraining corpus is split by `\n`, and any leading or trailing spaces are removed.
	- For the English training datasets, the pretraining corpus is split by `\n`, all letters are converted to lowercase, and any leading or trailing spaces are removed.

	- Word count statistics:
	- For Chinese, the [jieba](https://github.com/fxsjy/jieba) package is used for tokenization, and stopwords and non-Chinese characters are further filtered using [jionlp](https://github.com/dongrixinyu/JioNLP).
	- For English, the [nltk](https://github.com/nltk/nltk) package is used for tokenization, with built-in stopwords for filtering.

	- Sample filtering based on word count (heuristic thresholds):
	- For Chinese: Keep only samples with more than 5 words.
	- For English: Keep only samples with more than 5 words.

	- Dataset splitting: 90% of the data is used for training and 10% for testing.

	## Model Performance

	\| Dataset \| Precision \| Recall \|
	\|---------\|----------\|----------\|
	\| Train \| 0.9987 \| 0.9987 \|
	\| Test \| 0.9962 \| 0.9962 \|

	## Usage Example
	```python
	import fasttext
	from huggingface_hub import hf_hub_download

	def to_low(text):
	return text.strip().lower()

	model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
	model = fasttext.load_model(model_path)
	model.predict(to_low('Hello, world!'))
	```

	# fasttext-med-en-zh-identification

	[[中文]](#chinese) [[English]](#english)

	<a id="chinese"></a>

	该模型为[EPCD(Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP)项目的中间产物，主要用来区分医疗预训练语料中中文与英文样本。模型框架使用[fastText](https://github.com/facebookresearch/fastText)。

	# 数据组成

	## 中文通用预训练数据集
	- [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B)
	## 中文医疗预训练数据集
	- [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain)

	## 英文通用预训练数据集
	- [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)
	## 英文医疗预训练数据集
	- [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc)
	- [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data)

	上述数据集均为高质量开源数据集，可以节省很多数据清洗的工作，感谢上述开发者对开源数据社区的支持！

	# 数据清洗流程
	- 数据集初步整理
	- 对中文训练数据集，按`\n`分割预训练语料，去除开头和结尾可能存在的空格。
	- 对英文训练数据集，按`\n`分割预训练语料，将所有字母全部变为小写，去除开头和结尾可能存在的空格。
	- 统计词数量，具体的：
	- 对中文，使用[jieba](https://github.com/fxsjy/jieba)包进行分词，并利用[jionlp](https://github.com/dongrixinyu/JioNLP)进一步过滤停用词和非中文字符。
	- 对英文，使用[nltk](https://github.com/nltk/nltk)包进行分词，并利用内置停用词进行过滤。
	- 根据词数量进行样本过滤，具体的（经验数值）：
	- 对中文：仅保留词数量大于5的样本。
	- 对英文：仅保留词数量大于5的样本。
	- 切分数据集，训练集比例为0.9，测试集比例为0.1。

	# 模型表现
	\|Dataset \| Accuracy \|
	\|-------\|-------\|
	\|Train \| 0.9994\|
	\|Test \| 0.9998\|

	## Usage Example
	```python
	import fasttext
	from huggingface_hub import hf_hub_download

	def to_low(text):
	return text.strip().lower()

	model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
	model = fasttext.load_model(model_path)
	model.predict(to_low('Hello, world!'))
	```