Update README.md
Browse files
README.md
CHANGED
@@ -11,4 +11,161 @@ language:
|
|
11 |
base_model:
|
12 |
- distilbert/distilbert-base-multilingual-cased
|
13 |
pipeline_tag: token-classification
|
14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
base_model:
|
12 |
- distilbert/distilbert-base-multilingual-cased
|
13 |
pipeline_tag: token-classification
|
14 |
+
---
|
15 |
+
|
16 |
+
|
17 |
+
# Neural Wave - Hackathon 2024 - Lugano
|
18 |
+
|
19 |
+
This repository contains the code produced by the `Molise.ai` team in the Neural Wave Hackathon 2024 competition in
|
20 |
+
Lugano.
|
21 |
+
|
22 |
+
## Challenge
|
23 |
+
|
24 |
+
Here is a brief explanation of the challenge:
|
25 |
+
The challenge was proposed by **Ai4Privacy**, a company that builds global solutions that enhance **privacy protections
|
26 |
+
**
|
27 |
+
in the rapidly evolving world of **Artificial Intelligence**.
|
28 |
+
The challenge goal is to create a machine learning model capable of detecting and masking **PII** (Personal Identifiable
|
29 |
+
Information) in text data across several languages and locales. The task requires working with a synthetic dataset to
|
30 |
+
train models that can automatically identify and redact **17 types of PII** in natural language texts. The solution
|
31 |
+
should aim for high accuracy while maintaining the **usability** of the underlying data.
|
32 |
+
The final solution could be integrated into various systems and enhance privacy protections across industries,
|
33 |
+
including client support, legal, and general data anonymization tools. Success in this project will contribute to
|
34 |
+
scaling privacy-conscious AI systems without compromising the UX or operational performance.
|
35 |
+
|
36 |
+
## Getting Started
|
37 |
+
|
38 |
+
Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
|
39 |
+
|
40 |
+
```bash
|
41 |
+
cp .env.example .env
|
42 |
+
```
|
43 |
+
|
44 |
+
### Install the dependencies
|
45 |
+
|
46 |
+
```bash
|
47 |
+
pip install -r requirements.txt
|
48 |
+
```
|
49 |
+
|
50 |
+
## Set `PYTHONPATH` if needed
|
51 |
+
|
52 |
+
```bash
|
53 |
+
export PYTHONPATH="${PYTHONPATH}:$PWD"
|
54 |
+
```
|
55 |
+
|
56 |
+
## Inference
|
57 |
+
|
58 |
+
### Inference on the full dataset
|
59 |
+
|
60 |
+
You can run inference on the complete test dataset using the following command:
|
61 |
+
|
62 |
+
```bash
|
63 |
+
python inference.py -s ./dataset/test
|
64 |
+
```
|
65 |
+
|
66 |
+
### Inference on a small dataset
|
67 |
+
|
68 |
+
To perform inference on a small subset of the dataset, use the --subsample flag:
|
69 |
+
|
70 |
+
```bash
|
71 |
+
python inference.py -s ./dataset/test --subsample
|
72 |
+
```
|
73 |
+
|
74 |
+
## Run ui
|
75 |
+
|
76 |
+
To run the UI for interacting with the models and viewing results, use Streamlit:
|
77 |
+
|
78 |
+
```bash
|
79 |
+
streamlit run ui.py
|
80 |
+
```
|
81 |
+
|
82 |
+
## Run api
|
83 |
+
|
84 |
+
To start the API for the model, you'll need FastAPI. Run the following command:
|
85 |
+
|
86 |
+
```bash
|
87 |
+
fastapi run api.py
|
88 |
+
```
|
89 |
+
|
90 |
+
## Experiments
|
91 |
+
|
92 |
+
This repository supports two main types of experiments:
|
93 |
+
|
94 |
+
1. Fine-tuning models from the BERT family.
|
95 |
+
2. Fine-tuning models from the GLiNER family.
|
96 |
+
|
97 |
+
Both experiment types are located in the `experiments/` folder, and each fine-tuning script allows you to pass specific
|
98 |
+
arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
|
99 |
+
|
100 |
+
### BERT Fine-Tuning
|
101 |
+
|
102 |
+
The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
|
103 |
+
can utilize alternative columns that are preprocessed during the data preparation phase.
|
104 |
+
|
105 |
+
```bash
|
106 |
+
python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
|
107 |
+
```
|
108 |
+
|
109 |
+
#### Available BERT models
|
110 |
+
|
111 |
+
Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
|
112 |
+
may also work with minimal modifications:
|
113 |
+
|
114 |
+
- BERT classic
|
115 |
+
+ `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
|
116 |
+
- DistilBERT
|
117 |
+
+ `distilbert-base-uncased`, `distilbert-base-cased`
|
118 |
+
- RoBERTa
|
119 |
+
+ `roberta-base`, `roberta-large`
|
120 |
+
- ALBERT
|
121 |
+
+ `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`
|
122 |
+
- Electra
|
123 |
+
+ `google/electra-small-discriminator`, `google/electra-base-discriminator`, `google/electra-large-discriminator`
|
124 |
+
- DeBERTa
|
125 |
+
+ `microsoft/deberta-base`, `microsoft/deberta-large`
|
126 |
+
|
127 |
+
### GLiNER Fine-Tuning
|
128 |
+
|
129 |
+
The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
|
130 |
+
happens in two stages:
|
131 |
+
|
132 |
+
1. Step 1: Prepare Dataset for GLiNER Models
|
133 |
+
Run the GLiNER dataset preparation script to pre-process your dataset:
|
134 |
+
|
135 |
+
```bash
|
136 |
+
python experiments/gliner_prepare.py --dataset path/to/dataset
|
137 |
+
```
|
138 |
+
|
139 |
+
This will create a new JSON-formatted dataset file with the same name in the specified output directory.
|
140 |
+
|
141 |
+
2. Step 2: Fine-Tune GLiNER Model
|
142 |
+
|
143 |
+
```bash
|
144 |
+
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
|
145 |
+
```
|
146 |
+
|
147 |
+
After the dataset preparation, run the GLiNER fine-tuning script:
|
148 |
+
|
149 |
+
```bash
|
150 |
+
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
|
151 |
+
```
|
152 |
+
|
153 |
+
#### Available GLiNER models
|
154 |
+
|
155 |
+
You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
|
156 |
+
|
157 |
+
- `gliner-community/gliner_xxl-v2.5`
|
158 |
+
- `gliner-community/gliner_large-v2.5`
|
159 |
+
- `gliner-community/gliner_medium-v2.5`
|
160 |
+
- `gliner-community/gliner_small-v2.5`
|
161 |
+
|
162 |
+
## Results
|
163 |
+
|
164 |
+
A results folder is available in the repository to store the results of the various experiments and related metrics.
|
165 |
+
|
166 |
+
## Other Information
|
167 |
+
|
168 |
+
We also provide a solution to the issue in
|
169 |
+
the [pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k/discussions/3) repository.
|
170 |
+
We created a method to transform the natural language text into a token-tag format that can be used to train a Named
|
171 |
+
Entity Recognition (NER) model using the `AutoTrain` `huggingface` api.
|