BERT PII Detection Model
Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.
Model Details
- Base Model:
distilbert-base-uncased
- Task: Token Classification (Named Entity Recognition)
- Languages: English
- License: MIT
- Fine-tuned on: AI4Privacy PII-42k dataset
Supported PII Entity Types
This model can detect 56 different types of PII entities including:
Personal Information:
- FIRSTNAME, LASTNAME, MIDDLENAME
- EMAIL, PHONENUMBER, USERNAME
- DATE, TIME, DOB, AGE
Address Information:
- STREET, CITY, STATE, COUNTY
- ZIPCODE, BUILDINGNUMBER
- SECONDARYADDRESS
Financial Information:
- CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
- ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
- AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL
Identification:
- SSN, PIN, PASSWORD
- IP, IPV4, IPV6, MAC
- ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS
Professional Information:
- JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load model and tokenizer
model_name = "SoelMgd/bert-pii-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create NER pipeline
ner_pipeline = pipeline(
"ner",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple"
)
# Example usage
text = "Hi, my name is John Smith and my email is john.smith@company.com"
entities = ner_pipeline(text)
print(entities)
Training Data
- Dataset: AI4Privacy PII-200k
- Size: ~209k examples
- Languages: English, French, German, Italian (this model: English only)
- Entity Types: 56 different PII categories
Performance
The model achieves high performance on PII detection tasks with good precision and recall across different entity types.
Intended Use
This model is designed for:
- PII detection and masking in text
- Privacy compliance applications
- Data anonymization pipelines
- Content moderation systems
Limitations
- Trained primarily on English text
- May not generalize to domain-specific jargon
- Performance may vary on very short or very long texts
- Should be validated on your specific use case
- Downloads last month
- 13
Model tree for SoelMgd/bert-pii-detection
Base model
distilbert/distilbert-base-uncased