File size: 5,470 Bytes
bfc22c4
 
 
 
 
 
 
 
 
 
 
 
 
02a5500
 
 
9cb6f58
 
bfc22c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7946a19
 
26d09e6
 
bfc22c4
 
 
 
 
 
9cb6f58
 
 
 
0e06b1c
9cb6f58
 
 
02a5500
 
 
9cb6f58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26d09e6
 
 
bfc22c4
0e06b1c
26d09e6
bfc22c4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: apache-2.0
datasets:
- Skywork/SkyPile-150B
- ticoAg/shibing624-medical-pretrain
- togethercomputer/RedPajama-Data-V2
- medalpaca/medical_meadow_wikidoc
- nlp-guild/medical-data
language:
- en
- zh
pipeline_tag: text-classification
---
# fasttext-med-en-zh-identification

[[中文]](#chinese)    [[English]](#english)

<a id="english"></a>

This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText).

## Data Composition

### General Chinese Pretraining Dataset
- [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B)

### Medical Chinese Pretraining Dataset
- [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain)

### General English Pretraining Dataset
- [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)

### Medical English Pretraining Datasets
- [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc)
- [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data)

The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community!

## Data Cleaning Process

- Initial dataset processing:
  - For the Chinese training datasets, the pretraining corpus is split by `\n`, and any leading or trailing spaces are removed.
  - For the English training datasets, the pretraining corpus is split by `\n`, all letters are converted to lowercase, and any leading or trailing spaces are removed.
  
- Word count statistics:
  - For Chinese, the [jieba](https://github.com/fxsjy/jieba) package is used for tokenization, and stopwords and non-Chinese characters are further filtered using [jionlp](https://github.com/dongrixinyu/JioNLP).
  - For English, the [nltk](https://github.com/nltk/nltk) package is used for tokenization, with built-in stopwords for filtering.

- Sample filtering based on word count (heuristic thresholds):
  - For Chinese: Keep only samples with more than 5 words.
  - For English: Keep only samples with more than 5 words.

- Dataset splitting: 90% of the data is used for training and 10% for testing.

## Model Performance

| Dataset | Precision | Recall |
|---------|----------|----------|
|  Train  |  0.9987  |  0.9987  |
|  Test   |  0.9962  |  0.9962  |

## Usage Example
```python
import fasttext
from huggingface_hub import hf_hub_download

def to_low(text):
    return text.strip().lower()

model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model(model_path)
model.predict(to_low('Hello, world!'))
```

# fasttext-med-en-zh-identification

[[中文]](#chinese)    [[English]](#english)

<a id="chinese"></a>

该模型为[EPCD(Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP)项目的中间产物,主要用来区分医疗预训练语料中中文与英文样本。模型框架使用[fastText](https://github.com/facebookresearch/fastText)。

# 数据组成

## 中文通用预训练数据集
 - [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B)
## 中文医疗预训练数据集
 - [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain)

 ## 英文通用预训练数据集
  - [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)
 ## 英文医疗预训练数据集
  - [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc)
  - [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data)

 上述数据集均为高质量开源数据集,可以节省很多数据清洗的工作,感谢上述开发者对开源数据社区的支持!

# 数据清洗流程
 - 数据集初步整理
     - 对中文训练数据集,按`\n`分割预训练语料,去除开头和结尾可能存在的空格。
     - 对英文训练数据集,按`\n`分割预训练语料,将所有字母全部变为小写,去除开头和结尾可能存在的空格。
 - 统计词数量,具体的:
     - 对中文,使用[jieba](https://github.com/fxsjy/jieba)包进行分词,并利用[jionlp](https://github.com/dongrixinyu/JioNLP)进一步过滤停用词和非中文字符。
      - 对英文,使用[nltk](https://github.com/nltk/nltk)包进行分词,并利用内置停用词进行过滤。
 - 根据词数量进行样本过滤,具体的(经验数值):
      - 对中文:仅保留词数量大于5的样本。
      - 对英文:仅保留词数量大于5的样本。
 - 切分数据集,训练集比例为0.9,测试集比例为0.1。

 # 模型表现
 |Dataset | Accuracy |
|-------|-------|
|Train | 0.9994|
|Test | 0.9998|

## Usage Example
```python
import fasttext
from huggingface_hub import hf_hub_download

def to_low(text):
    return text.strip().lower()

model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model(model_path)
model.predict(to_low('Hello, world!'))
```