Model Card for ai4privacy-mdeberta-v3-base-general-preprocessed

This is a model aiming to detect the PII (Personal Identifiable Information), training by "The Last Ones" team on NeuralWave Hackthon.

Model Details

This model was fine-tuned from microsoft/mdeberta-v3-base on ai4privacy/pii-masking-400k dataset.

We use the following arguments for training variable for finetuning:

  • learning_rate=3e-5,
  • per_device_train_batch_size=58,
  • per_device_eval_batch_size=58,
  • num_train_epochs=3,
  • weight_decay=0.01,
  • bf16=True,
  • seed=42

and other default hyperparameters of TrainingArguments.

Training Data

ai4privacy/pii-masking-400k

Preprocessing

def generate_sequence_labels(text, privacy_mask):
    # sort privacy mask by start position
    privacy_mask = sorted(privacy_mask, key=lambda x: x['start'], reverse=True)
    
    # replace sensitive pieces of text with labels
    for item in privacy_mask:
        label = item['label']
        start = item['start']
        end = item['end']
        value = item['value']
        # count the number of words in the value
        word_count = len(value.split())
        
        # replace the sensitive information with the appropriate number of [label] placeholders
        replacement = " ".join([f"{label}" for _ in range(word_count)])
        text = text[:start] + replacement + text[end:]
        
    words = text.split()
    # assign labels to each word
    labels = []
    for word in words:
        match = re.search(r"(\w+)", word)  # match any word character
        if match:
            label = match.group(1)
            if label in label_set:
                labels.append(label)
            else:
                # any other word is labeled as "O"
                labels.append("O")
        else:
            labels.append("O")
    return labels
k = 0
def tokenize_and_align_labels(examples):
    words = [t.split() for t in examples["source_text"]]
    tokenized_inputs = tokenizer(words, truncation=True, is_split_into_words=True, max_length=512)
    source_labels = [
        generate_sequence_labels(text, mask)
        for text, mask in zip(examples["source_text"], examples["privacy_mask"])
    ]

    labels = []
    valid_idx = []
    for i, label in enumerate(source_labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # map tokens to their respective word.
        previous_label = None
        label_ids = [-100]
        try:
            for word_idx in word_ids:
                if word_idx is None:
                    continue
                elif label[word_idx] == "O":
                    label_ids.append(label2id["O"])
                    continue
                elif previous_label == label[word_idx]:
                    label_ids.append(label2id[f"I-{label[word_idx]}"])
                else:
                    label_ids.append(label2id[f"B-{label[word_idx]}"])
                previous_label = label[word_idx]
            label_ids = label_ids[:511] + [-100]
            labels.append(label_ids)
            # print(word_ids)
            # print(label_ids)
        except:
            global k
            k += 1
            # print(f"{word_idx = }")
            # print(f"{len(label) = }")
            labels.append([-100] * len(tokenized_inputs["input_ids"][i]))

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

We use this two function to generate the source-text-level labels and then use it to align the tokens and token-level labels so that you can use any kinds of models and tokenizers to train on ai4privacy/pii-masking-400k.

Evaluation

image/png

Some evaluation of this model on validation set (model 2) is shown in the table.

Disclaimer Cooment of Non-Affiliation

The publisher of this repository is not affiliate with Ai4Privacy and Ai Suisse SA.

@NerualWave 2024 - The Last Ones Team.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for bobocn/ai4privacy-mdeberta-v3-base-general-preprocessed

Finetuned
(153)
this model

Dataset used to train bobocn/ai4privacy-mdeberta-v3-base-general-preprocessed