Citation Pre-Screening

Overview

Click to expand

Model type: Language Model
Architecture: DistilBERT
Language: Multilingual
License: Apache 2.0
Task: Binary Classification (Citation Pre-Screening)
Dataset: SIRIS-Lab/citation-parser-TYPE
Additional Resources:
- GitHub

Model description

The Citation Pre-Screening model is part of the Citation Parser package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on DistilBERT, is specifically designed for automated citation processing workflows, making it an essential component of the Citation Parser tool for citation metadata extraction and validation.

The model was trained on a dataset containing citation texts, with the labels True (valid citation) and False (invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label.

The fine-tuning process was done with the DistilBERT-base-multilingual-cased architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data.

Intended Usage

This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows.

How to use

from transformers import pipeline

# Load the model
citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening")

# Example citation text
citation_text = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》"

# Classify the citation
result = citation_classifier(citation_text)
print(result)

Training

The model was trained using the Citation Pre-Screening Dataset consisting of:

Training data: 3599 samples
Test data: 400 samples

The following hyperparameters were used for training:

Model Path: distilbert/distilbert-base-multilingual-cased
Batch Size: 32
Number of Epochs: 4
Learning Rate: 2e-5
Max Sequence Length: 512

Evaluation Metrics

The model's performance was evaluated on the test set, and the following results were obtained:

Metric	Value
Accuracy	0.95
Macro avg F1	0.94
Weighted avg F1	0.95

Additional information

Authors

SIRIS Lab, Research Division of SIRIS Academic.

License

This work is distributed under a Apache License, Version 2.0.

Contact

For further information, send an email to either nicolau.duransilva@sirisacademic.com or info@sirisacademic.com.

SIRIS-Lab
/

citation-parser-TYPE