gaodrew commited on
Commit
3b18682
1 Parent(s): bb9913f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -28
README.md CHANGED
@@ -29,33 +29,24 @@ language:
29
  pipeline_tag: token-classification
30
  ---
31
 
32
-
33
-
34
  # piiranha-v1
35
  Piiranha is trained to detect 17 types of Personally Identifiable Information (PII) across six languages. It successfully catches 98.27% of PII tokens, with an overall classification accuracy of 99.44%.
 
36
 
37
  Supported languages: English, Spanish, French, German, Italian, Dutch
38
  Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode.
39
 
40
- ACCOUNTNUM 0.84 0.87 0.85 3575
41
- BUILDINGNUM 0.92 0.90 0.91 3252
42
- CITY 0.95 0.97 0.96 7270
43
- CREDITCARDNUMBER 0.94 0.96 0.95 2308
44
- DATEOFBIRTH 0.93 0.85 0.89 3389
45
- DRIVERLICENSENUM 0.96 0.96 0.96 2244
46
- EMAIL 1.00 1.00 1.00 6892
47
- GIVENNAME 0.87 0.93 0.90 12150
48
- IDCARDNUM 0.89 0.94 0.91 3700
49
- PASSWORD 0.98 0.98 0.98 2387
50
- SOCIALNUM 0.93 0.94 0.93 2709
51
- STREET 0.97 0.95 0.96 3331
52
- SURNAME 0.89 0.78 0.83 8267
53
- TAXNUM 0.97 0.89 0.93 2322
54
- TELEPHONENUM 0.99 1.00 0.99 5039
55
- USERNAME 0.98 0.98 0.98 7680
56
- ZIPCODE 0.94 0.97 0.95 3191
57
-
58
- It is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base).
59
  It achieves the following results on a test set of ~73,000 sentences containing PII:
60
  - Accuracy: 99.44%
61
  - Loss: 0.0173
@@ -63,18 +54,41 @@ It achieves the following results on a test set of ~73,000 sentences containing
63
  - Recall: 93.08%
64
  - F1: 93.12%
65
 
66
- ## Model description
67
-
68
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ## Intended uses & limitations
71
 
72
- More information needed
73
-
74
  ## Training and evaluation data
75
 
76
- More information needed
77
-
78
  ## Training procedure
79
 
80
  ### Training hyperparameters
 
29
  pipeline_tag: token-classification
30
  ---
31
 
 
 
32
  # piiranha-v1
33
  Piiranha is trained to detect 17 types of Personally Identifiable Information (PII) across six languages. It successfully catches 98.27% of PII tokens, with an overall classification accuracy of 99.44%.
34
+ Piiranha is especially accurate at detecting passwords, emails (100%), phone numbers, and usernames.
35
 
36
  Supported languages: English, Spanish, French, German, Italian, Dutch
37
  Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode.
38
 
39
+ Performance on PII vs. Non PII classification task:
40
+ **Precision: 98.48%** (98.48% of tokens classified as PII are actually PII)
41
+ **Recall: 98.27%** (correctly identifies 98.27% of PII tokens)
42
+ **Specificity: 99.84%** (correctly identifies 99.84% of Non PII tokens)
43
+
44
+ <img src="https://cloud-3i4ld6u5y-hack-club-bot.vercel.app/0home.png" alt="Akash Network logo" width="400"/>
45
+
46
+ Piiranha was trained on an H100 GPU rented through the [Akash Network](https://akash.network/).
47
+
48
+ ## Model Description
49
+ Piiranha is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base).
 
 
 
 
 
 
 
 
50
  It achieves the following results on a test set of ~73,000 sentences containing PII:
51
  - Accuracy: 99.44%
52
  - Loss: 0.0173
 
54
  - Recall: 93.08%
55
  - F1: 93.12%
56
 
57
+ Note that the above metrics factor in the eighteen possible categories (17 PII and 1 Non PII), so the metrics are lower than the metrics for just PII vs. Non PII (binary classification).
58
+
59
+ ## Performance by PII type
60
+ Reported performance metrics are lower than the overall accuracy of 99.44% due to class imbalance (most tokens are not PII).
61
+ However, the model is more useful than the below results suggest, due to the intent behind PII detection. The model sometimes misclassifies one PII type for another, but at the end of the day, it still recognizes the token as PII.
62
+ For instance, the model often confuses first names for last names, but that's fine because it still flags the name as PII.
63
+
64
+ | Entity | Precision | Recall | F1-Score | Support |
65
+ |---------------------|-----------|--------|----------|---------|
66
+ | ACCOUNTNUM | 0.84 | 0.87 | 0.85 | 3575 |
67
+ | BUILDINGNUM | 0.92 | 0.90 | 0.91 | 3252 |
68
+ | CITY | 0.95 | 0.97 | 0.96 | 7270 |
69
+ | CREDITCARDNUMBER | 0.94 | 0.96 | 0.95 | 2308 |
70
+ | DATEOFBIRTH | 0.93 | 0.85 | 0.89 | 3389 |
71
+ | DRIVERLICENSENUM | 0.96 | 0.96 | 0.96 | 2244 |
72
+ | EMAIL | 1.00 | 1.00 | 1.00 | 6892 |
73
+ | GIVENNAME | 0.87 | 0.93 | 0.90 | 12150 |
74
+ | IDCARDNUM | 0.89 | 0.94 | 0.91 | 3700 |
75
+ | PASSWORD | 0.98 | 0.98 | 0.98 | 2387 |
76
+ | SOCIALNUM | 0.93 | 0.94 | 0.93 | 2709 |
77
+ | STREET | 0.97 | 0.95 | 0.96 | 3331 |
78
+ | SURNAME | 0.89 | 0.78 | 0.83 | 8267 |
79
+ | TAXNUM | 0.97 | 0.89 | 0.93 | 2322 |
80
+ | TELEPHONENUM | 0.99 | 1.00 | 0.99 | 5039 |
81
+ | USERNAME | 0.98 | 0.98 | 0.98 | 7680 |
82
+ | ZIPCODE | 0.94 | 0.97 | 0.95 | 3191 |
83
+ | **micro avg** | 0.93 | 0.93 | 0.93 | 79706 |
84
+ | **macro avg** | 0.94 | 0.93 | 0.93 | 79706 |
85
+ | **weighted avg** | 0.93 | 0.93 | 0.93 | 79706 |
86
 
87
  ## Intended uses & limitations
88
 
89
+ Piiranha can be used to assist with redacting PII from texts. Use at your own risk. We do not accept responsibility for any incorrect model predictions.
 
90
  ## Training and evaluation data
91
 
 
 
92
  ## Training procedure
93
 
94
  ### Training hyperparameters