Turkish Privacy Filter PII

Repository: BTX24/turkish-privacy-filter-pii

This repository contains a Turkish-oriented fine-tuned checkpoint of openai/privacy-filter for privacy-focused PII span detection.

The model is designed to detect Turkish privacy-sensitive spans such as Turkish-style person names, e-mail addresses, phone numbers, dates, addresses, account identifiers, IBAN-like values, TCKN-like values, VKN-like values, and secret-like tokens.

Important: This model is a privacy redaction aid. It is not a legal anonymization guarantee.


Türkçe Açıklama

Bu model, openai/privacy-filter temel alınarak Türkçe gizlilik odaklı PII / kişisel veri tespiti için fine-tune edilmiş bir checkpoint’tir.

Modelin amacı; Türkçe metinlerde geçen kişi adı, e-posta, telefon, tarih, adres, hesap numarası, IBAN benzeri değer, T.C. kimlik numarası benzeri değer, vergi kimlik numarası benzeri değer ve secret/token/parola benzeri hassas alanları span düzeyinde tespit etmektir.

Önemli: Bu model gizlilik odaklı veri maskeleme/redaction için yardımcı bir araçtır. Tek başına hukuki anonimleştirme veya KVKK uyumluluk garantisi sağlamaz.


Base Model

Base model:

openai/privacy-filter

OpenAI Privacy Filter is a bidirectional token-classification model for detecting and redacting personally identifiable information (PII) in text. This checkpoint adapts the base model to a Turkish-oriented privacy label space.


Training Dataset

This model was fine-tuned on:

BTX24/turkish-privacy-pii-ner

Dataset summary:

Metric Value
Dataset BTX24/turkish-privacy-pii-ner
Language Turkish
Dataset type Synthetic span-based NER / PII detection
Total rows 103,923
Training examples 83,138
Validation examples 10,392
Test examples 10,393
Label classes 10 privacy labels + O
Annotation type Character-level spans
Data type Fully synthetic
Dataset license CC BY 4.0

The training data is fully synthetic and contains no intentionally collected real personal information.


Label Space

The fine-tuned checkpoint uses the following Turkish privacy label space:

{
  "category_version": "tr_privacy_v1",
  "span_class_names": [
    "O",
    "tckn",
    "secret",
    "iban",
    "vkn",
    "account_number",
    "private_address",
    "private_date",
    "private_phone",
    "private_email",
    "private_person"
  ]
}

Label Descriptions

Label Description
O Outside / non-PII token
tckn Synthetic Turkish national identity-like values
secret Synthetic passwords, OTPs, API-key-like strings, tokens, recovery codes
iban Synthetic Turkish IBAN-like values
vkn Synthetic Turkish tax identification-like values
account_number Customer numbers, account identifiers, reference codes, membership IDs, ticket/order references
private_address Synthetic Turkish-style address expressions
private_date Privacy-relevant date expressions
private_phone Turkish mobile-number-like synthetic phone numbers
private_email Synthetic non-routable e-mail addresses
private_person Synthetic Turkish-style person names

Training Summary

The model was fine-tuned for Turkish privacy-oriented span detection.

Metric Value
Best epoch 3
Best metric validation_loss
Best validation loss 0.002157915852249276
Number of training examples 83,138
Number of validation examples 10,392
Span class names 11 classes including O

Epoch Metrics

Epoch Train Loss Train Token Accuracy Validation Loss Validation Token Accuracy
1 0.038908 0.990768 0.003961 0.999100
2 0.002714 0.999463 0.002181 0.999583
3 0.001393 0.999704 0.002158 0.999641

The best checkpoint was selected at epoch 3 based on validation loss.


Test Evaluation

Evaluation was performed on the synthetic Turkish test split.

Metric Value
Test examples 10,393
Test tokens 241,020
Eval mode typed
Loss 0.0028
Token accuracy 0.9996
Inference tokens/sec 3027.20

Detection Metrics

Metric Value
Detection precision 0.9998
Detection recall 0.9996
Detection F1 0.9997
Detection F2 0.9996
Span precision 0.9988
Span recall 0.9978
Span F1 0.9983
Span F2 0.9980

Per-Class Span Metrics

Label Precision Recall F1 F2
tckn 1.0000 0.9990 0.9995 0.9992
secret 0.9990 1.0000 0.9995 0.9998
iban 1.0000 1.0000 1.0000 1.0000
vkn 0.9990 0.9990 0.9990 0.9990
account_number 1.0000 0.9992 0.9996 0.9993
private_address 0.9980 0.9940 0.9960 0.9948
private_date 1.0000 1.0000 1.0000 1.0000
private_phone 0.9991 1.0000 0.9995 0.9998
private_email 1.0000 0.9902 0.9951 0.9922
private_person 0.9928 0.9959 0.9943 0.9953

These scores are measured on a synthetic test split. Real-world performance may differ, especially on noisy user text, OCR output, mixed-language documents, informal spelling, or domain-specific records.


Intended Use

This checkpoint is intended for research and development in Turkish privacy-preserving NLP.

Recommended use cases:

  • Turkish PII detection
  • Turkish privacy-oriented NER
  • Character-level privacy span detection
  • Data redaction and masking
  • Privacy-aware preprocessing pipelines
  • Fine-tuning / evaluating OpenAI Privacy Filter-style models on Turkish data
  • Academic NLP projects
  • Controlled experiments with synthetic Turkish privacy data

Out-of-Scope Use

This model should not be used as:

  • A legal compliance guarantee
  • A complete anonymization system by itself
  • A guarantee that all sensitive information will be detected
  • A replacement for human privacy review
  • A production privacy safeguard without further evaluation
  • A detector for real-world high-risk documents without additional domain testing

For production or high-risk use cases, evaluate the model on in-domain Turkish data and combine it with rule-based checks, human review, logging safeguards, and privacy-by-design processes.


Local Usage

1. Install OpenAI Privacy Filter

git clone https://github.com/openai/privacy-filter.git
cd privacy-filter
pip install -e .

2. Download this checkpoint

python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='BTX24/turkish-privacy-filter-pii', local_dir='tr_privacy_filter_pii')"

3. Run inference with OPF CLI

For CUDA:

opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "Mehmet Kaya TCKN 12345678901"

For CPU:

opf --checkpoint ./tr_privacy_filter_pii --device cpu --format json "Mehmet Kaya TCKN 12345678901"

Example Turkish inputs:

opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "Ahmet Yılmaz için telefon numarası 0532 000 00 00 olarak kaydedildi."
opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "İade için IBAN TR00 0000 0000 0000 0000 0000 00 bilgisi girildi."
opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "Doğrulama kodu OTP-482193 destek kaydına yazılmış."
opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "Fatura için VKN 1234567890 ve referans kodu REF-TR-000918 girildi."

Python Download Example

from huggingface_hub import snapshot_download

checkpoint_dir = snapshot_download(
    repo_id="BTX24/turkish-privacy-filter-pii",
    local_dir="tr_privacy_filter_pii",
)

print(checkpoint_dir)

Training / Fine-tuning Notebook

This repository includes the training notebook:

privacy_filter_tr_pii_colab.ipynb

The notebook documents the Turkish fine-tuning workflow.

It may include steps such as:

  • Loading the Turkish privacy PII dataset
  • Preparing the Turkish privacy label space
  • Validating character-level spans
  • Preparing train/validation/test files
  • Running OpenAI Privacy Filter fine-tuning
  • Saving model artifacts
  • Testing inference examples
  • Exporting checkpoint files for Hugging Face upload
  • Evaluating the fine-tuned checkpoint on the synthetic test split

Artifacts

Expected repository artifacts:

.
├── README.md
├── config.json
├── model.safetensors
├── finetune_summary.json
├── USAGE.txt
├── label_space.json
└── privacy_filter_tr_pii_colab.ipynb

Artifact Descriptions

File Description
config.json Model/configuration file
model.safetensors Fine-tuned model checkpoint weights
finetune_summary.json Fine-tuning summary and metadata
USAGE.txt Basic usage notes
label_space.json Turkish privacy label space
privacy_filter_tr_pii_colab.ipynb Training / fine-tuning notebook

Suggested Evaluation

Recommended metrics:

  • Entity-level precision
  • Entity-level recall
  • Entity-level F1
  • Typed span F1
  • Untyped span F1
  • Per-label F1
  • Boundary error analysis
  • False positive analysis
  • False negative analysis

Recommended comparison baselines:

  • Original openai/privacy-filter without Turkish fine-tuning
  • Regex-based detector for structured fields such as phone, e-mail, IBAN, TCKN-like values and VKN-like values
  • Turkish BERT-style NER model
  • This fine-tuned Turkish Privacy Filter PII checkpoint

Limitations

This checkpoint was fine-tuned on synthetic Turkish privacy data. Therefore:

  • It may not fully capture noisy real-world Turkish text.
  • It may overfit to synthetic templates.
  • It may miss rare PII formats not represented in the dataset.
  • It may produce false positives for numeric or code-like strings.
  • It may struggle with long documents, OCR errors, mixed-language text, informal spelling, or domain-specific formats.
  • It should be evaluated on in-domain data before production use.
  • It should not be treated as a legal anonymization guarantee.

Ethical Considerations

The training dataset is synthetic and was designed to avoid intentionally collecting or distributing real personal information.

However, a model trained on synthetic data can still make mistakes. It may fail to detect sensitive information or incorrectly classify non-sensitive text as PII. For sensitive applications, this model should be used as one layer in a broader privacy-preserving pipeline, not as the only safeguard.

Recommended safeguards:

  • Human review for high-risk workflows
  • Domain-specific evaluation
  • Regex/rule-based checks for structured identifiers
  • Logging and monitoring of redaction failures
  • Conservative handling of uncertain predictions
  • Regular evaluation on newly observed data distributions

License

This fine-tuned checkpoint is released under the Apache License 2.0.

The base model is openai/privacy-filter.

The training dataset BTX24/turkish-privacy-pii-ner is released under Creative Commons Attribution 4.0 International (CC BY 4.0).

Please review the licenses of the base model, dataset, and this fine-tuned checkpoint before use.


Citation

If you use this model, please cite:

@model{toktay_2026_turkish_privacy_filter_pii,
  title        = {Turkish Privacy Filter PII},
  author       = {Boran Toktay},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/BTX24/turkish-privacy-filter-pii}},
  note         = {Fine-tuned OpenAI Privacy Filter checkpoint for Turkish PII span detection}
}

If you use the training dataset, please also cite:

@dataset{toktay_2026_turkish_privacy_pii_ner,
  title        = {Turkish Privacy PII NER Dataset},
  author       = {Boran Toktay},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/BTX24/turkish-privacy-pii-ner}},
  note         = {Synthetic Turkish privacy-oriented named entity recognition dataset for PII detection}
}

Contact

For questions, issues, or suggestions, please open an issue or discussion on the model repository:

BTX24/turkish-privacy-filter-pii
Downloads last month
31
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BTX24/turkish-privacy-filter-pii

Finetuned
(36)
this model

Dataset used to train BTX24/turkish-privacy-filter-pii

Evaluation results