Citation Pre-Screening
Overview
Click to expand
- Model type: Language Model
- Architecture: DistilBERT
- Language: Multilingual
- License: Apache 2.0
- Task: Binary Classification (Citation Pre-Screening)
- Dataset: SIRIS-Lab/citation-parser-TYPE
- Additional Resources:
Model description
The Citation Pre-Screening model is part of the Citation Parser
package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on DistilBERT, is specifically designed for automated citation processing workflows, making it an essential component of the Citation Parser tool for citation metadata extraction and validation.
The model was trained on a dataset containing citation texts, with the labels True
(valid citation) and False
(invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label.
The fine-tuning process was done with the DistilBERT-base-multilingual-cased architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data.
Intended Usage
This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows.
How to use
from transformers import pipeline
# Load the model
citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening")
# Example citation text
citation_text = "MURAKAMI, Hç‰: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》"
# Classify the citation
result = citation_classifier(citation_text)
print(result)
Training
The model was trained using the Citation Pre-Screening Dataset consisting of:
- Training data: 3599 samples
- Test data: 400 samples
The following hyperparameters were used for training:
- Model Path:
distilbert/distilbert-base-multilingual-cased
- Batch Size: 32
- Number of Epochs: 4
- Learning Rate: 2e-5
- Max Sequence Length: 512
Evaluation Metrics
The model's performance was evaluated on the test set, and the following results were obtained:
Metric | Value |
---|---|
Accuracy | 0.95 |
Macro avg F1 | 0.94 |
Weighted avg F1 | 0.95 |
Additional information
Authors
- SIRIS Lab, Research Division of SIRIS Academic.
License
This work is distributed under a Apache License, Version 2.0.
Contact
For further information, send an email to either nicolau.duransilva@sirisacademic.com or info@sirisacademic.com.
- Downloads last month
- 42