Token Classification
Transformers
Safetensors
distilbert
Inference Endpoints
RedHitMark commited on
Commit
01ffc0c
·
verified ·
1 Parent(s): 4c0e288

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -1
README.md CHANGED
@@ -11,4 +11,161 @@ language:
11
  base_model:
12
  - distilbert/distilbert-base-multilingual-cased
13
  pipeline_tag: token-classification
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  base_model:
12
  - distilbert/distilbert-base-multilingual-cased
13
  pipeline_tag: token-classification
14
+ ---
15
+
16
+
17
+ # Neural Wave - Hackathon 2024 - Lugano
18
+
19
+ This repository contains the code produced by the `Molise.ai` team in the Neural Wave Hackathon 2024 competition in
20
+ Lugano.
21
+
22
+ ## Challenge
23
+
24
+ Here is a brief explanation of the challenge:
25
+ The challenge was proposed by **Ai4Privacy**, a company that builds global solutions that enhance **privacy protections
26
+ **
27
+ in the rapidly evolving world of **Artificial Intelligence**.
28
+ The challenge goal is to create a machine learning model capable of detecting and masking **PII** (Personal Identifiable
29
+ Information) in text data across several languages and locales. The task requires working with a synthetic dataset to
30
+ train models that can automatically identify and redact **17 types of PII** in natural language texts. The solution
31
+ should aim for high accuracy while maintaining the **usability** of the underlying data.
32
+ The final solution could be integrated into various systems and enhance privacy protections across industries,
33
+ including client support, legal, and general data anonymization tools. Success in this project will contribute to
34
+ scaling privacy-conscious AI systems without compromising the UX or operational performance.
35
+
36
+ ## Getting Started
37
+
38
+ Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
39
+
40
+ ```bash
41
+ cp .env.example .env
42
+ ```
43
+
44
+ ### Install the dependencies
45
+
46
+ ```bash
47
+ pip install -r requirements.txt
48
+ ```
49
+
50
+ ## Set `PYTHONPATH` if needed
51
+
52
+ ```bash
53
+ export PYTHONPATH="${PYTHONPATH}:$PWD"
54
+ ```
55
+
56
+ ## Inference
57
+
58
+ ### Inference on the full dataset
59
+
60
+ You can run inference on the complete test dataset using the following command:
61
+
62
+ ```bash
63
+ python inference.py -s ./dataset/test
64
+ ```
65
+
66
+ ### Inference on a small dataset
67
+
68
+ To perform inference on a small subset of the dataset, use the --subsample flag:
69
+
70
+ ```bash
71
+ python inference.py -s ./dataset/test --subsample
72
+ ```
73
+
74
+ ## Run ui
75
+
76
+ To run the UI for interacting with the models and viewing results, use Streamlit:
77
+
78
+ ```bash
79
+ streamlit run ui.py
80
+ ```
81
+
82
+ ## Run api
83
+
84
+ To start the API for the model, you'll need FastAPI. Run the following command:
85
+
86
+ ```bash
87
+ fastapi run api.py
88
+ ```
89
+
90
+ ## Experiments
91
+
92
+ This repository supports two main types of experiments:
93
+
94
+ 1. Fine-tuning models from the BERT family.
95
+ 2. Fine-tuning models from the GLiNER family.
96
+
97
+ Both experiment types are located in the `experiments/` folder, and each fine-tuning script allows you to pass specific
98
+ arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
99
+
100
+ ### BERT Fine-Tuning
101
+
102
+ The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
103
+ can utilize alternative columns that are preprocessed during the data preparation phase.
104
+
105
+ ```bash
106
+ python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
107
+ ```
108
+
109
+ #### Available BERT models
110
+
111
+ Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
112
+ may also work with minimal modifications:
113
+
114
+ - BERT classic
115
+ + `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
116
+ - DistilBERT
117
+ + `distilbert-base-uncased`, `distilbert-base-cased`
118
+ - RoBERTa
119
+ + `roberta-base`, `roberta-large`
120
+ - ALBERT
121
+ + `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`
122
+ - Electra
123
+ + `google/electra-small-discriminator`, `google/electra-base-discriminator`, `google/electra-large-discriminator`
124
+ - DeBERTa
125
+ + `microsoft/deberta-base`, `microsoft/deberta-large`
126
+
127
+ ### GLiNER Fine-Tuning
128
+
129
+ The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
130
+ happens in two stages:
131
+
132
+ 1. Step 1: Prepare Dataset for GLiNER Models
133
+ Run the GLiNER dataset preparation script to pre-process your dataset:
134
+
135
+ ```bash
136
+ python experiments/gliner_prepare.py --dataset path/to/dataset
137
+ ```
138
+
139
+ This will create a new JSON-formatted dataset file with the same name in the specified output directory.
140
+
141
+ 2. Step 2: Fine-Tune GLiNER Model
142
+
143
+ ```bash
144
+ python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
145
+ ```
146
+
147
+ After the dataset preparation, run the GLiNER fine-tuning script:
148
+
149
+ ```bash
150
+ python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
151
+ ```
152
+
153
+ #### Available GLiNER models
154
+
155
+ You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
156
+
157
+ - `gliner-community/gliner_xxl-v2.5`
158
+ - `gliner-community/gliner_large-v2.5`
159
+ - `gliner-community/gliner_medium-v2.5`
160
+ - `gliner-community/gliner_small-v2.5`
161
+
162
+ ## Results
163
+
164
+ A results folder is available in the repository to store the results of the various experiments and related metrics.
165
+
166
+ ## Other Information
167
+
168
+ We also provide a solution to the issue in
169
+ the [pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k/discussions/3) repository.
170
+ We created a method to transform the natural language text into a token-tag format that can be used to train a Named
171
+ Entity Recognition (NER) model using the `AutoTrain` `huggingface` api.