BERT PII Detection Model

Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.

Model Details

Base Model: distilbert-base-uncased
Task: Token Classification (Named Entity Recognition)
Languages: English
License: MIT
Fine-tuned on: AI4Privacy PII-42k dataset

Supported PII Entity Types

This model can detect 56 different types of PII entities including:

Personal Information:

FIRSTNAME, LASTNAME, MIDDLENAME
EMAIL, PHONENUMBER, USERNAME
DATE, TIME, DOB, AGE

Address Information:

STREET, CITY, STATE, COUNTY
ZIPCODE, BUILDINGNUMBER
SECONDARYADDRESS

Financial Information:

CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL

Identification:

SSN, PIN, PASSWORD
IP, IPV4, IPV6, MAC
ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS

Professional Information:

JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
model_name = "SoelMgd/bert-pii-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline(
    "ner", 
    model=model, 
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Example usage
text = "Hi, my name is John Smith and my email is john.smith@company.com"
entities = ner_pipeline(text)
print(entities)

Training Data

Dataset: AI4Privacy PII-200k
Size: ~209k examples
Languages: English, French, German, Italian (this model: English only)
Entity Types: 56 different PII categories

Performance

The model achieves high performance on PII detection tasks with good precision and recall across different entity types.

Intended Use

This model is designed for:

PII detection and masking in text
Privacy compliance applications
Data anonymization pipelines
Content moderation systems

Limitations

Trained primarily on English text
May not generalize to domain-specific jargon
Performance may vary on very short or very long texts
Should be validated on your specific use case

SoelMgd
/

bert-pii-detection