Token Classification
Transformers
Safetensors
PyTorch
English
bert
ner
pii
pii-detection
de-identification
privacy
healthcare
medical
clinical
phi
hipaa
openmed
Eval Results (legacy)
Instructions to use OpenMed/OpenMed-PII-BioClinicalBERT-Base-110M-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMed/OpenMed-PII-BioClinicalBERT-Base-110M-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="OpenMed/OpenMed-PII-BioClinicalBERT-Base-110M-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("OpenMed/OpenMed-PII-BioClinicalBERT-Base-110M-v1") model = AutoModelForTokenClassification.from_pretrained("OpenMed/OpenMed-PII-BioClinicalBERT-Base-110M-v1") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: apache-2.0 | |
| base_model: emilyalsentzer/Bio_ClinicalBERT | |
| tags: | |
| - token-classification | |
| - ner | |
| - pii | |
| - pii-detection | |
| - de-identification | |
| - privacy | |
| - healthcare | |
| - medical | |
| - clinical | |
| - phi | |
| - hipaa | |
| - pytorch | |
| - transformers | |
| - openmed | |
| datasets: | |
| - nvidia/Nemotron-PII | |
| pipeline_tag: token-classification | |
| library_name: transformers | |
| metrics: | |
| - f1 | |
| - precision | |
| - recall | |
| model-index: | |
| - name: OpenMed-PII-BioClinicalBERT-110M-v1 | |
| results: | |
| - task: | |
| type: token-classification | |
| name: Named Entity Recognition | |
| dataset: | |
| name: nvidia/Nemotron-PII (test_strat) | |
| type: nvidia/Nemotron-PII | |
| split: test | |
| metrics: | |
| - type: f1 | |
| value: 0.9437 | |
| name: F1 (micro) | |
| - type: precision | |
| value: 0.9449 | |
| name: Precision | |
| - type: recall | |
| value: 0.9426 | |
| name: Recall | |
| widget: | |
| - text: "Dr. Sarah Johnson (SSN: 123-45-6789) can be reached at sarah.johnson@hospital.org or 555-123-4567. She lives at 123 Oak Street, Boston, MA 02108." | |
| example_title: Clinical Note with PII | |
| # OpenMed-PII-BioClinicalBERT-110M-v1 | |
| **PII Detection Model** | 110M Parameters | Open Source | |
| []() []() []() | |
| ## Model Description | |
| **OpenMed-PII-BioClinicalBERT-110M-v1** is a transformer-based token classification model fine-tuned for **Personally Identifiable Information (PII) detection** in text. This model identifies and classifies **54 types of sensitive information** including names, addresses, SSNs, medical record numbers, and more. | |
| ### Key Features | |
| - **High Accuracy**: Achieves strong F1 scores across diverse PII categories | |
| - **Comprehensive Coverage**: Detects 50+ entity types spanning personal, financial, medical, and contact information | |
| - **Privacy-Focused**: Designed for de-identification and compliance with HIPAA, GDPR, and other privacy regulations | |
| - **Production-Ready**: Optimized for real-world text processing pipelines | |
| ## Performance | |
| Evaluated on a stratified 2,000-sample test set from NVIDIA Nemotron-PII: | |
| | Metric | Score | | |
| |:---|:---:| | |
| | **Micro F1** | **0.9437** | | |
| | Precision | 0.9449 | | |
| | Recall | 0.9426 | | |
| | Macro F1 | 0.9462 | | |
| | Weighted F1 | 0.9434 | | |
| | Accuracy | 0.9925 | | |
| ### Top 10 PII Models | |
| | Rank | Model | F1 | Precision | Recall | | |
| |:---:|:---|:---:|:---:|:---:| | |
| | 1 | [OpenMed-PII-SuperClinical-Large-434M-v1](https://huggingface.co/openmed/OpenMed-PII-SuperClinical-Large-434M-v1) | 0.9608 | 0.9685 | 0.9532 | | |
| | 2 | [OpenMed-PII-BigMed-Large-560M-v1](https://huggingface.co/openmed/OpenMed-PII-BigMed-Large-560M-v1) | 0.9604 | 0.9644 | 0.9565 | | |
| | 3 | [OpenMed-PII-EuroMed-210M-v1](https://huggingface.co/openmed/OpenMed-PII-EuroMed-210M-v1) | 0.9600 | 0.9681 | 0.9521 | | |
| | 4 | [OpenMed-PII-SnowflakeMed-568M-v1](https://huggingface.co/openmed/OpenMed-PII-SnowflakeMed-568M-v1) | 0.9594 | 0.9640 | 0.9548 | | |
| | 5 | [OpenMed-PII-SuperMedical-Large-355M-v1](https://huggingface.co/openmed/OpenMed-PII-SuperMedical-Large-355M-v1) | 0.9592 | 0.9632 | 0.9553 | | |
| | 6 | [OpenMed-PII-ClinicalBGE-568M-v1](https://huggingface.co/openmed/OpenMed-PII-ClinicalBGE-568M-v1) | 0.9587 | 0.9636 | 0.9538 | | |
| | 7 | [OpenMed-PII-mClinicalE5-Large-560M-v1](https://huggingface.co/openmed/OpenMed-PII-mClinicalE5-Large-560M-v1) | 0.9582 | 0.9631 | 0.9533 | | |
| | 8 | [OpenMed-PII-ModernMed-Large-395M-v1](https://huggingface.co/openmed/OpenMed-PII-ModernMed-Large-395M-v1) | 0.9579 | 0.9639 | 0.9520 | | |
| | 9 | [OpenMed-PII-BioClinicalModern-Large-395M-v1](https://huggingface.co/openmed/OpenMed-PII-BioClinicalModern-Large-395M-v1) | 0.9579 | 0.9656 | 0.9502 | | |
| | 10 | [OpenMed-PII-ClinicalE5-Large-335M-v1](https://huggingface.co/openmed/OpenMed-PII-ClinicalE5-Large-335M-v1) | 0.9577 | 0.9604 | 0.9550 | | |
| ### Best Performing Entities | |
| | Entity | F1 | Precision | Recall | Support | | |
| |:---|:---:|:---:|:---:|:---:| | |
| | `tax_id` | 1.000 | 1.000 | 1.000 | 43 | | |
| | `ssn` | 0.996 | 0.993 | 1.000 | 141 | | |
| | `biometric_identifier` | 0.996 | 1.000 | 0.991 | 232 | | |
| | `email` | 0.995 | 0.995 | 0.995 | 757 | | |
| | `date_of_birth` | 0.995 | 0.989 | 1.000 | 273 | | |
| ### Challenging Entities | |
| These entity types have lower performance and may benefit from additional post-processing: | |
| | Entity | F1 | Precision | Recall | Support | | |
| |:---|:---:|:---:|:---:|:---:| | |
| | `fax_number` | 0.870 | 0.810 | 0.940 | 100 | | |
| | `time` | 0.864 | 0.893 | 0.838 | 468 | | |
| | `sexuality` | 0.837 | 0.809 | 0.867 | 83 | | |
| | `gender` | 0.815 | 0.769 | 0.867 | 188 | | |
| | `occupation` | 0.639 | 0.654 | 0.625 | 717 | | |
| ## Supported Entity Types | |
| This model detects **54 PII entity types** organized into categories: | |
| <details> | |
| <summary><strong>Identifiers</strong> (16 types)</summary> | |
| | Entity | Description | | |
| |:---|:---| | |
| | `account_number` | Account Number | | |
| | `api_key` | Api Key | | |
| | `bank_routing_number` | Bank Routing Number | | |
| | `certificate_license_number` | Certificate License Number | | |
| | `credit_debit_card` | Credit Debit Card | | |
| | `cvv` | Cvv | | |
| | `employee_id` | Employee Id | | |
| | `health_plan_beneficiary_number` | Health Plan Beneficiary Number | | |
| | `mac_address` | Mac Address | | |
| | `medical_record_number` | Medical Record Number | | |
| | ... | *and 6 more* | | |
| </details> | |
| <details> | |
| <summary><strong>Personal Info</strong> (14 types)</summary> | |
| | Entity | Description | | |
| |:---|:---| | |
| | `age` | Age | | |
| | `biometric_identifier` | Biometric Identifier | | |
| | `blood_type` | Blood Type | | |
| | `date_of_birth` | Date Of Birth | | |
| | `education_level` | Education Level | | |
| | `first_name` | First Name | | |
| | `last_name` | Last Name | | |
| | `gender` | Gender | | |
| | `language` | Language | | |
| | `occupation` | Occupation | | |
| | ... | *and 4 more* | | |
| </details> | |
| <details> | |
| <summary><strong>Contact Info</strong> (4 types)</summary> | |
| | Entity | Description | | |
| |:---|:---| | |
| | `email` | Email | | |
| | `phone_number` | Phone Number | | |
| | `fax_number` | Fax Number | | |
| | `url` | Url | | |
| </details> | |
| <details> | |
| <summary><strong>Location</strong> (6 types)</summary> | |
| | Entity | Description | | |
| |:---|:---| | |
| | `city` | City | | |
| | `coordinate` | Coordinate | | |
| | `country` | Country | | |
| | `county` | County | | |
| | `state` | State | | |
| | `street_address` | Street Address | | |
| </details> | |
| <details> | |
| <summary><strong>Network Info</strong> (3 types)</summary> | |
| | Entity | Description | | |
| |:---|:---| | |
| | `device_identifier` | Device Identifier | | |
| | `ipv4` | Ipv4 | | |
| | `ipv6` | Ipv6 | | |
| </details> | |
| <details> | |
| <summary><strong>Temporal</strong> (3 types)</summary> | |
| | Entity | Description | | |
| |:---|:---| | |
| | `date` | Date | | |
| | `date_time` | Date Time | | |
| | `time` | Time | | |
| </details> | |
| <details> | |
| <summary><strong>Organization</strong> (1 types)</summary> | |
| | Entity | Description | | |
| |:---|:---| | |
| | `company_name` | Company Name | | |
| </details> | |
| ## Usage | |
| ### Quick Start | |
| ```python | |
| from transformers import pipeline | |
| # Load the PII detection pipeline | |
| ner = pipeline("ner", model="openmed/OpenMed-PII-BioClinicalBERT-110M-v1", aggregation_strategy="simple") | |
| text = """ | |
| Patient John Smith (DOB: 03/15/1985, SSN: 123-45-6789) was seen today. | |
| Contact: john.smith@email.com, Phone: (555) 123-4567. | |
| Address: 456 Oak Street, Boston, MA 02108. | |
| """ | |
| entities = ner(text) | |
| for entity in entities: | |
| print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})") | |
| ``` | |
| ### De-identification Example | |
| ```python | |
| def redact_pii(text, entities, placeholder='[REDACTED]'): | |
| """Replace detected PII with placeholders.""" | |
| # Sort entities by start position (descending) to preserve offsets | |
| sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True) | |
| redacted = text | |
| for ent in sorted_entities: | |
| redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:] | |
| return redacted | |
| # Apply de-identification | |
| redacted_text = redact_pii(text, entities) | |
| print(redacted_text) | |
| ``` | |
| ### Batch Processing | |
| ```python | |
| from transformers import AutoModelForTokenClassification, AutoTokenizer | |
| import torch | |
| model_name = "openmed/OpenMed-PII-BioClinicalBERT-110M-v1" | |
| model = AutoModelForTokenClassification.from_pretrained(model_name) | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| texts = [ | |
| "Contact Dr. Jane Doe at jane.doe@hospital.org", | |
| "Patient SSN: 987-65-4321, MRN: 12345678", | |
| ] | |
| inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| predictions = torch.argmax(outputs.logits, dim=-1) | |
| ``` | |
| ## Training Details | |
| ### Dataset | |
| - **Source**: [NVIDIA Nemotron-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII) | |
| - **Format**: BIO-tagged token classification | |
| - **Labels**: 106 total (53 entity types × 2 BIO tags + O) | |
| - **Splits**: 50K train / 5K validation / 45K test | |
| ### Training Configuration | |
| - **Max Sequence Length**: 384 tokens | |
| - **Label Strategy**: First token only (`label_all_tokens=False`) | |
| - **Framework**: Hugging Face Transformers + Trainer API | |
| ## Intended Use & Limitations | |
| ### Intended Use | |
| - **De-identification**: Automated redaction of PII in clinical notes, medical records, and documents | |
| - **Compliance**: Supporting HIPAA, GDPR, and privacy regulation compliance | |
| - **Data Preprocessing**: Preparing datasets for research by removing sensitive information | |
| - **Audit Support**: Identifying PII in document collections | |
| ### Limitations | |
| ⚠️ **Important**: This model is intended as an **assistive tool**, not a replacement for human review. | |
| - **False Negatives**: Some PII may not be detected; always verify critical applications | |
| - **Context Sensitivity**: Performance may vary with domain-specific terminology | |
| - **Challenging Categories**: `occupation`, `time`, and `sexuality` have lower F1 scores | |
| - **Language**: Primarily trained on English text | |
| ## Citation | |
| ```bibtex | |
| @misc{openmed-pii-2026, | |
| title = {OpenMed-PII-BioClinicalBERT-110M-v1: PII Detection Model}, | |
| author = {OpenMed Science}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/openmed/OpenMed-PII-BioClinicalBERT-110M-v1} | |
| } | |
| ``` | |
| ## Links | |
| - **Organization**: [OpenMed](https://huggingface.co/OpenMed) | |