molise-ai
/

pii-detector-ai4privacy

 base_model:
 - distilbert/distilbert-base-multilingual-cased
 pipeline_tag: token-classification
+---
+# Neural Wave - Hackathon 2024 - Lugano
+This repository contains the code produced by the `Molise.ai` team in the Neural Wave Hackathon 2024 competition in
+Lugano.
+## Challenge
+Here is a brief explanation of the challenge:
+The challenge was proposed by **Ai4Privacy**, a company that builds global solutions that enhance **privacy protections
+**
+in the rapidly evolving world of **Artificial Intelligence**.
+The challenge goal is to create a machine learning model capable of detecting and masking **PII** (Personal Identifiable
+Information) in text data across several languages and locales. The task requires working with a synthetic dataset to
+train models that can automatically identify and redact **17 types of PII** in natural language texts. The solution
+should aim for high accuracy while maintaining the **usability** of the underlying data.
+The final solution could be integrated into various systems and enhance privacy protections across industries,
+including client support, legal, and general data anonymization tools. Success in this project will contribute to
+scaling privacy-conscious AI systems without compromising the UX or operational performance.
+## Getting Started
+Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
+```bash
+cp .env.example .env
+```
+### Install the dependencies
+```bash
+pip install -r requirements.txt
+```
+## Set `PYTHONPATH` if needed
+```bash
+export PYTHONPATH="${PYTHONPATH}:$PWD"
+```
+## Inference
+### Inference on the full dataset
+You can run inference on the complete test dataset using the following command:
+```bash
+python inference.py -s ./dataset/test
+```
+### Inference on a small dataset
+To perform inference on a small subset of the dataset, use the --subsample flag:
+```bash
+python inference.py -s ./dataset/test --subsample
+```
+## Run ui
+To run the UI for interacting with the models and viewing results, use Streamlit:
+```bash
+streamlit run ui.py
+```
+## Run api
+To start the API for the model, you'll need FastAPI. Run the following command:
+```bash
+fastapi run api.py
+```
+## Experiments
+This repository supports two main types of experiments:
+1. Fine-tuning models from the BERT family.
+2. Fine-tuning models from the GLiNER family.
+Both experiment types are located in the `experiments/` folder, and each fine-tuning script allows you to pass specific
+arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
+### BERT Fine-Tuning
+The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
+can utilize alternative columns that are preprocessed during the data preparation phase.
+```bash
+python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
+```
+#### Available BERT models
+Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
+may also work with minimal modifications:
+- BERT classic
+    + `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
+- DistilBERT
+    + `distilbert-base-uncased`, `distilbert-base-cased`
+- RoBERTa
+    + `roberta-base`, `roberta-large`
+- ALBERT
+    + `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`
+- Electra
+    + `google/electra-small-discriminator`, `google/electra-base-discriminator`, `google/electra-large-discriminator`
+- DeBERTa
+    + `microsoft/deberta-base`, `microsoft/deberta-large`
+### GLiNER Fine-Tuning
+The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
+happens in two stages:
+1. Step 1: Prepare Dataset for GLiNER Models
+   Run the GLiNER dataset preparation script to pre-process your dataset:
+```bash
+python experiments/gliner_prepare.py --dataset path/to/dataset
+```
+This will create a new JSON-formatted dataset file with the same name in the specified output directory.
+2. Step 2: Fine-Tune GLiNER Model
+```bash
+python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
+```
+After the dataset preparation, run the GLiNER fine-tuning script:
+```bash
+python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
+```
+#### Available GLiNER models
+You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
+- `gliner-community/gliner_xxl-v2.5`
+- `gliner-community/gliner_large-v2.5`
+- `gliner-community/gliner_medium-v2.5`
+- `gliner-community/gliner_small-v2.5`
+## Results
+A results folder is available in the repository to store the results of the various experiments and related metrics.
+## Other Information
+We also provide a solution to the issue in
+the [pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k/discussions/3) repository.
+We created a method to transform the natural language text into a token-tag format that can be used to train a Named
+Entity Recognition (NER) model using the `AutoTrain` `huggingface` api.