metadata

license: apache-2.0
language:
  - zh
metrics:
  - accuracy
  - precision
base_model:
  - Qwen/Qwen2.5-0.5B

Qwen2.5-med-book-main-classification

[中文] [English]

The model is an intermediate product of the EPCD (Easy-Data-Clean-Pipeline) project, primarily used to distinguish between the main content and non-content (such as book introductions, publisher information, writing standards, revision notes) of medical textbooks after performing OCR using MinerU. The base model uses Qwen2.5-0.5B, avoiding the length limitation of the Bert Tokenizer while providing higher accuracy.

Data Composition

The data consists of scanned PDF copies of textbooks, converted into Markdown files through OCR using MinerU. After a simple regex-based cleaning, the samples were split using \n, and a Bloom probabilistic filter was used for precise deduplication, resulting in 50,000 samples. Due to certain legal considerations, we may not plan to make the dataset publicly available.
Due to the nature of textbooks, most samples are main content. According to statistics, in our dataset, 79.89% (40,000) are main content samples, while 20.13% (10,000) are non-content samples. Considering data imbalance, we evaluate the model's performance on both Precision and Accuracy metrics on the test set.
To ensure consistency in the data distribution between the test set and the training set, we used stratified sampling to select 10% of the data as the test set.

Training Techniques

To maximize model accuracy, we used Bayesian optimization (TPE algorithm) and Hyperband pruning (HyperbandPruner) to accelerate hyperparameter tuning.

Model Performance

Dataset	Accuracy	Precision
Train	0.9894	0.9673
Test	0.9788	0.9548

Usage

import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

ID2LABEL = {0: "正文", 1: "非正文"}

model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = '下列为修订说明'
encoding = tokenizer(text, return_tensors='pt')
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
id = torch.argmax(logits, dim=-1).item()
response = ID2LABEL[id]
print(response)
# "非正文"

For Batch Usage

import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

ID2LABEL = {0: "正文", 1: "非正文"}

model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

text = ['下列为修订说明','阴离子间隙是一项受到广泛重视的酸碱指标。AG是一个计算值，指血浆中未测定的阴离子与未测定的阳离子的差值，正常机体血浆中的阳离子与阴离子总量相等，均为151mmol/L，从而维持电荷平衡。']
encoding = tokenizer(text, return_tensors='pt',padding=True)
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
ids = torch.argmax(logits, dim=-1).tolist()
response = [ID2LABEL[id] for id in ids]
print(response)
# ['非正文', '正文']

Qwen2.5-med-book-main-classification

[中文] [English]

该模型为EPCD(Easy-Data-Clean-Pipeline)项目的中间产物，主要用来区分使用MinerU进行OCR后的医学教科书的正文与非正文（书本简介、出版社信息、编写规范、修订说明）样本。基础模型使用Qwen2.5-0.5B，避免了Bert Tokenizer长度的限制，并且提供了更高的精度。

数据组成

数据由教科书PDF扫描件，经过MinerU进行OCR后生成的Markdown文件。经过简单的正则化清洗，使用\n进行分割样本，经过Bloom概率过滤器精准去重，最终产生了5W条样本。由于涉及一些法律条款，我们可能没有计划公开数据集。
由于教科书的特性，样本大多为正文样本，根据统计，在我们的数据集中，正文样本占总样本的79.89%（4W条），非正文样本占总样本的20.13%（1W条）。由于数据的不平衡性，我们综合考虑模型在测试集上的Precision和Accuracy指标。
为了保证测试集与训练集数据分布一致，我们使用分层抽样，选取10%的数据构成测试集。

训练技巧

为了尽可能提高模型精度，我们使用了贝叶斯优化（TPE算法）和Hyperband修剪器（HyperbandPruner）加快模型调参效率。

模型表现

Dataset	Accuracy	Precision
Train	0.9894	0.9673
Test	0.9788	0.9548

Usage

import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

ID2LABEL = {0: "正文", 1: "非正文"}

model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = '下列为修订说明'
encoding = tokenizer(text, return_tensors='pt')
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
id = torch.argmax(logits, dim=-1).item()
response = ID2LABEL[id]
print(response)
# "非正文"

For Batch Usage

import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

ID2LABEL = {0: "正文", 1: "非正文"}

model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

text = ['下列为修订说明','阴离子间隙是一项受到广泛重视的酸碱指标。AG是一个计算值，指血浆中未测定的阴离子与未测定的阳离子的差值，正常机体血浆中的阳离子与阴离子总量相等，均为151mmol/L，从而维持电荷平衡。']
encoding = tokenizer(text, return_tensors='pt',padding=True)
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
ids = torch.argmax(logits, dim=-1).tolist()
response = [ID2LABEL[id] for id in ids]
print(response)
# ['非正文', '正文']