File size: 7,200 Bytes
d822c70 a935e90 f26ca2d a69a1d2 a935e90 a69a1d2 d822c70 2dfc001 a69a1d2 a935e90 a69a1d2 e8bc107 a69a1d2 2dfc001 d822c70 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
license: apache-2.0
language:
- zh
metrics:
- accuracy
- precision
base_model:
- Qwen/Qwen2.5-0.5B
---
<a id="english"></a>
# Qwen2.5-med-book-main-classification
[[中文]](#chinese) [[English]](#english)
The model is an intermediate product of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project, primarily used to distinguish between the main content and non-content (such as book introductions, publisher information, writing standards, revision notes) of **medical textbooks** after performing OCR using [MinerU](https://github.com/opendatalab/MinerU). The base model uses [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), avoiding the length limitation of the Bert Tokenizer while providing higher accuracy.
# Data Composition
- The data consists of scanned PDF copies of textbooks, converted into `Markdown` files through `OCR` using [MinerU](https://github.com/opendatalab/MinerU). After a simple regex-based cleaning, the samples were split using `\n`, and a `Bloom` probabilistic filter was used for precise deduplication, resulting in 50,000 samples. Due to certain legal considerations, we may not plan to make the dataset publicly available.
- Due to the nature of textbooks, most samples are main content. According to statistics, in our dataset, 79.89% (40,000) are main content samples, while 20.13% (10,000) are non-content samples. Considering data imbalance, we evaluate the model's performance on both Precision and Accuracy metrics on the test set.
- To ensure consistency in the data distribution between the test set and the training set, we used stratified sampling to select 10% of the data as the test set.
# Training Techniques
- To maximize model accuracy, we used Bayesian optimization (TPE algorithm) and Hyperband pruning (HyperbandPruner) to accelerate hyperparameter tuning.
# Model Performance
| Dataset | Accuracy | Precision |
|---------|----------|-----------|
| Train | 0.9894 | 0.9673 |
| Test | 0.9788 | 0.9548 |
# Usage
```python
import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
ID2LABEL = {0: "正文", 1: "非正文"}
model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = '下列为修订说明'
encoding = tokenizer(text, return_tensors='pt')
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
id = torch.argmax(logits, dim=-1).item()
response = ID2LABEL[id]
print(response)
# "非正文"
```
# For Batch Usage
```python
import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
ID2LABEL = {0: "正文", 1: "非正文"}
model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
text = ['下列为修订说明','阴离子间隙是一项受到广泛重视的酸碱指标。AG是一个计算值,指血浆中未测定的阴离子与未测定的阳离子的差值,正常机体血浆中的阳离子与阴离子总量相等,均为151mmol/L,从而维持电荷平衡。']
encoding = tokenizer(text, return_tensors='pt',padding=True)
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
ids = torch.argmax(logits, dim=-1).tolist()
response = [ID2LABEL[id] for id in ids]
print(response)
# ['非正文', '正文']
```
<a id="chinese"></a>
# Qwen2.5-med-book-main-classification
[[中文]](#chinese) [[English]](#english)
该模型为[EPCD(Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP)项目的中间产物,主要用来区分使用[MinerU](https://github.com/opendatalab/MinerU)进行OCR后的**医学教科书**的正文与非正文(书本简介、出版社信息、编写规范、修订说明)样本。基础模型使用[Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B),避免了Bert Tokenizer长度的限制,并且提供了更高的精度。
# 数据组成
- 数据由教科书PDF扫描件,经过[MinerU](https://github.com/opendatalab/MinerU)进行`OCR`后生成的`Markdown`文件。经过简单的正则化清洗,使用`\n`进行分割样本,经过`Bloom`概率过滤器精准去重,最终产生了5W条样本。由于涉及一些法律条款,我们可能没有计划公开数据集。
- 由于教科书的特性,样本大多为正文样本,根据统计,在我们的数据集中,正文样本占总样本的79.89%(4W条),非正文样本占总样本的20.13%(1W条)。由于数据的不平衡性,我们综合考虑模型在测试集上的Precision和Accuracy指标。
- 为了保证测试集与训练集数据分布一致,我们使用分层抽样,选取10%的数据构成测试集。
# 训练技巧
- 为了尽可能提高模型精度,我们使用了贝叶斯优化(TPE算法)和Hyperband修剪器(HyperbandPruner)加快模型调参效率。
# 模型表现
| Dataset | Accuracy | Precision |
|---------|----------|-----------|
| Train | 0.9894 | 0.9673 |
| Test | 0.9788 | 0.9548 |
# Usage
```python
import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
ID2LABEL = {0: "正文", 1: "非正文"}
model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = '下列为修订说明'
encoding = tokenizer(text, return_tensors='pt')
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
id = torch.argmax(logits, dim=-1).item()
response = ID2LABEL[id]
print(response)
# "非正文"
```
# For Batch Usage
```python
import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
ID2LABEL = {0: "正文", 1: "非正文"}
model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
text = ['下列为修订说明','阴离子间隙是一项受到广泛重视的酸碱指标。AG是一个计算值,指血浆中未测定的阴离子与未测定的阳离子的差值,正常机体血浆中的阳离子与阴离子总量相等,均为151mmol/L,从而维持电荷平衡。']
encoding = tokenizer(text, return_tensors='pt',padding=True)
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
ids = torch.argmax(logits, dim=-1).tolist()
response = [ID2LABEL[id] for id in ids]
print(response)
# ['非正文', '正文']
``` |