BERT PII Detection Model

Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.

Model Details

  • Base Model: distilbert-base-uncased
  • Task: Token Classification (Named Entity Recognition)
  • Languages: English
  • License: MIT
  • Fine-tuned on: AI4Privacy PII-42k dataset

Supported PII Entity Types

This model can detect 56 different types of PII entities including:

Personal Information:

  • FIRSTNAME, LASTNAME, MIDDLENAME
  • EMAIL, PHONENUMBER, USERNAME
  • DATE, TIME, DOB, AGE

Address Information:

  • STREET, CITY, STATE, COUNTY
  • ZIPCODE, BUILDINGNUMBER
  • SECONDARYADDRESS

Financial Information:

  • CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
  • ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
  • AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL

Identification:

  • SSN, PIN, PASSWORD
  • IP, IPV4, IPV6, MAC
  • ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS

Professional Information:

  • JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
model_name = "SoelMgd/bert-pii-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline(
    "ner", 
    model=model, 
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Example usage
text = "Hi, my name is John Smith and my email is john.smith@company.com"
entities = ner_pipeline(text)
print(entities)

Training Data

  • Dataset: AI4Privacy PII-200k
  • Size: ~209k examples
  • Languages: English, French, German, Italian (this model: English only)
  • Entity Types: 56 different PII categories

Performance

The model achieves high performance on PII detection tasks with good precision and recall across different entity types.

Intended Use

This model is designed for:

  • PII detection and masking in text
  • Privacy compliance applications
  • Data anonymization pipelines
  • Content moderation systems

Limitations

  • Trained primarily on English text
  • May not generalize to domain-specific jargon
  • Performance may vary on very short or very long texts
  • Should be validated on your specific use case
Downloads last month
13
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SoelMgd/bert-pii-detection

Finetuned
(9485)
this model

Dataset used to train SoelMgd/bert-pii-detection

Space using SoelMgd/bert-pii-detection 1