metadata

license: apache-2.0
language:
  - zh
metrics:
  - accuracy
  - precision
base_model:
  - Qwen/Qwen2.5-0.5B

Qwen2.5-med-book-main-classification

The model is an intermediate product of the EPCD (Easy-Data-Clean-Pipeline) project, primarily used to distinguish between the main content and non-content (such as book introductions, publisher information, writing standards, revision notes) of medical textbooks after performing OCR using MinerU. The base model uses Qwen2.5-0.5B, avoiding the length limitation of the Bert Tokenizer while providing higher accuracy.

Data Composition

The data consists of scanned PDF copies of textbooks, converted into Markdown files through OCR using MinerU. After a simple regex-based cleaning, the samples were split using \n, and a Bloom probabilistic filter was used for precise deduplication, resulting in 50,000 samples. Due to certain legal considerations, we may not plan to make the dataset publicly available.
Due to the nature of textbooks, most samples are main content. According to statistics, in our dataset, 79.89% (40,000) are main content samples, while 20.13% (10,000) are non-content samples. Considering data imbalance, we evaluate the model's performance on both Precision and Accuracy metrics on the test set.
To ensure consistency in the data distribution between the test set and the training set, we used stratified sampling to select 10% of the data as the test set.

Training Techniques

To maximize model accuracy, we used Bayesian optimization (TPE algorithm) and Hyperband pruning (HyperbandPruner) to accelerate hyperparameter tuning.

Model Performance

Dataset	Accuracy	Precision
Train	0.9894	0.9673
Test	0.9788	0.9548

Usage

import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

ID2LABEL = {0: "正文", 1: "非正文"}

model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = '下列为修订说明'
encoding = tokenizer(text, return_tensors='pt')
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
id = torch.argmax(logits, dim=-1).item()
response = ID2LABEL[id]
print(response)
# "非正文"

For Batch Usage

import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

ID2LABEL = {0: "正文", 1: "非正文"}

model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

text = ['下列为修订说明','阴离子间隙是一项受到广泛重视的酸碱指标。AG是一个计算值，指血浆中未测定的阴离子与未测定的阳离子的差值，正常机体血浆中的阳离子与阴离子总量相等，均为151mmol/L，从而维持电荷平衡。']
encoding = tokenizer(text, return_tensors='pt',padding=True)
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
ids = torch.argmax(logits, dim=-1).tolist()
response = [ID2LABEL[id] for id in ids]
print(response)
# ['非正文', '正文']