ytzfhqs's picture
Update README.md
0e06b1c verified
|
raw
history blame
5.47 kB
metadata
license: apache-2.0
datasets:
  - Skywork/SkyPile-150B
  - ticoAg/shibing624-medical-pretrain
  - togethercomputer/RedPajama-Data-V2
  - medalpaca/medical_meadow_wikidoc
  - nlp-guild/medical-data
language:
  - en
  - zh
pipeline_tag: text-classification

fasttext-med-en-zh-identification

[中文] [English]

This model is an intermediate result of the EPCD (Easy-Data-Clean-Pipeline) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses fastText.

Data Composition

General Chinese Pretraining Dataset

Medical Chinese Pretraining Dataset

General English Pretraining Dataset

Medical English Pretraining Datasets

The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community!

Data Cleaning Process

  • Initial dataset processing:

    • For the Chinese training datasets, the pretraining corpus is split by \n, and any leading or trailing spaces are removed.
    • For the English training datasets, the pretraining corpus is split by \n, all letters are converted to lowercase, and any leading or trailing spaces are removed.
  • Word count statistics:

    • For Chinese, the jieba package is used for tokenization, and stopwords and non-Chinese characters are further filtered using jionlp.
    • For English, the nltk package is used for tokenization, with built-in stopwords for filtering.
  • Sample filtering based on word count (heuristic thresholds):

    • For Chinese: Keep only samples with more than 5 words.
    • For English: Keep only samples with more than 5 words.
  • Dataset splitting: 90% of the data is used for training and 10% for testing.

Model Performance

Dataset Precision Recall
Train 0.9987 0.9987
Test 0.9962 0.9962

Usage Example

import fasttext
from huggingface_hub import hf_hub_download

def to_low(text):
    return text.strip().lower()

model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model(model_path)
model.predict(to_low('Hello, world!'))

fasttext-med-en-zh-identification

[中文] [English]

该模型为EPCD(Easy-Data-Clean-Pipeline)项目的中间产物,主要用来区分医疗预训练语料中中文与英文样本。模型框架使用fastText

数据组成

中文通用预训练数据集

中文医疗预训练数据集

英文通用预训练数据集

英文医疗预训练数据集

上述数据集均为高质量开源数据集,可以节省很多数据清洗的工作,感谢上述开发者对开源数据社区的支持!

数据清洗流程

  • 数据集初步整理
    • 对中文训练数据集,按\n分割预训练语料,去除开头和结尾可能存在的空格。
    • 对英文训练数据集,按\n分割预训练语料,将所有字母全部变为小写,去除开头和结尾可能存在的空格。
  • 统计词数量,具体的:
    • 对中文,使用jieba包进行分词,并利用jionlp进一步过滤停用词和非中文字符。
    • 对英文,使用nltk包进行分词,并利用内置停用词进行过滤。
  • 根据词数量进行样本过滤,具体的(经验数值):
    • 对中文:仅保留词数量大于5的样本。
    • 对英文:仅保留词数量大于5的样本。
  • 切分数据集,训练集比例为0.9,测试集比例为0.1。

模型表现

Dataset Accuracy
Train 0.9994
Test 0.9998

Usage Example

import fasttext
from huggingface_hub import hf_hub_download

def to_low(text):
    return text.strip().lower()

model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model(model_path)
model.predict(to_low('Hello, world!'))