README.md · venkatd/NCBI

metadata

license: unknown
datasets:
  - ncbi/pubmed
language:
  - en
metrics:
  - f1
base_model:
  - microsoft/deberta-v3-base
pipeline_tag: token-classification
tags:
  - NER
  - phenotypes
  - diseases
  - bio
  - classification

How to Use the Model for Inference:

You can use the Hugging Face pipeline for easy inference:

from transformers import pipeline

# Load the model
model_path = "venkatd/NCBI_NER"
pipe = pipeline(
    task="token-classification",
    model=model_path,
    tokenizer=model_path,
    aggregation_strategy="simple"
)

# Test the pipeline
text = ("A 48-year-old female presented with vaginal bleeding and abnormal Pap smears. "
        "Upon diagnosis of invasive non-keratinizing SCC of the cervix, she underwent a radical "
        "hysterectomy with salpingo-oophorectomy which demonstrated positive spread to the pelvic "
        "lymph nodes and the parametrium.")
result = pipe(text)
print(result)

Output Example:

The output will be entity type of Disease, score, and start/end positions in the text. Here’s a sample output format:

[
    {
        "entity_group": "Disease",
        "score": 0.98,
        "word": "SCC of the cervix",
        "start": 121,
        "end": 139
    },
    ...
]

Model Summary and Training Details

Model Architecture

Base Model: microsoft/deberta-v3-base
Task: Token Classification for Named Entity Recognition (NER) with a focus on disease entities.
Number of Labels: 3 (O, B-Disease, I-Disease)

Dataset

Dataset: NCBI Disease Corpus
- Description: The NCBI Disease corpus is a specialized medical dataset that includes 793 PubMed abstracts. It is structured to help in identifying disease mentions within scientific literature, and each mention is annotated with disease concepts from the MeSH (Medical Subject Headings) or OMIM (Online Mendelian Inheritance in Man) databases.
- Split:
  - Training Set: 593 abstracts
  - Development (Validation) Set: 100 abstracts
  - Test Set: 100 abstracts

Training Details

Training Steps: The model was trained using a cross-entropy loss function for token classification tasks. To optimize performance, we used gradient accumulation to achieve a stable loss and improve resource efficiency.
Gradient Accumulation: 2 steps
Batch Size: 8
Device: Trained on a GPU if available, using mixed-precision training for better performance.

Optimizer and Learning Rate Scheduler

Optimizer: AdamW
- Learning Rate: 1e-5
- Betas: (0.9, 0.999)
- Epsilon: 1e-8
Learning Rate Scheduler: Cosine Scheduler with Warmup
- Warmup Steps: 10% of total training steps
- Total Training Steps: Calculated as len(train_loader) * num_epochs

Epochs and Validation

Epochs: 5
Training and Validation Loss: The model achieved a stable loss over 5 epochs, with the best validation loss recorded. The best model based on validation loss was saved for evaluation.

Evaluation and Performance

Test Dataset F1 Score: 0.9772
- Evaluation Metric: F1 score, which indicates the balance between precision and recall, was used as the primary metric to assess the model’s performance.