metadata

license: apache-2.0
language:
  - zh
metrics:
  - accuracy
  - precision
base_model:
  - Qwen/Qwen2.5-0.5B

The model is an intermediate product of the EPCD (Easy-Data-Clean-Pipeline) project, primarily used to distinguish between the main content and non-content (such as book introductions, publisher information, writing standards, revision notes) of medical textbooks after performing OCR using MinerU. The base model uses Qwen2.5-0.5B, avoiding the length limitation of the Bert Tokenizer while providing higher accuracy.

Data Composition

The data consists of scanned PDF copies of textbooks, converted into Markdown files through OCR using MinerU. After a simple regex-based cleaning, the samples were split using \n, and a Bloom probabilistic filter was used for precise deduplication, resulting in 50,000 samples. Due to certain legal considerations, we may not plan to make the dataset publicly available.
Due to the nature of textbooks, most samples are main content. According to statistics, in our dataset, 79.89% (40,000) are main content samples, while 20.13% (10,000) are non-content samples. Considering data imbalance, we evaluate the model's performance on both Precision and Accuracy metrics on the test set.
To ensure consistency in the data distribution between the test set and the training set, we used stratified sampling to select 10% of the data as the test set.

Training Techniques

To maximize model accuracy, we used Bayesian optimization (TPE algorithm) and Hyperband pruning (HyperbandPruner) to accelerate hyperparameter tuning.

Model Performance

Dataset	Accuracy	Precision
Train	0.9894	0.9673
Test	0.9788	0.9548

Usage

import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

ID2LABEL = {0: "正文", 1: "非正文"}

model_name = 'ytzfhqs/Qwen2.5-med-book-main-classification'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = '下列为修订说明'
encoding = tokenizer(text, return_tensors='pt')
encoding = {k: v.to(model.device) for k, v in encoding.items()}
outputs = model(**encoding)
logits = outputs.logits
id = torch.argmax(logits, dim=-1).item()
response = ID2LABEL[id]
print(response)