Oct 9, 2023

Hi,
I tried calling starpii to detect personal information in the code, such as name and email.
My codes as shown below:

checkpoint = "bigcode/starpii"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(checkpoint ).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint )
data = "Python\nuser_name = 'wrc'\nemail='iuewfn@gmail.com'\ndata=abcdefg\n"
inputs = tokenizer.encode(data, return_tensors="pt").to(device)
outputs = model.generate(inputs,max_length=100)
print(tokenizer.decode(outputs[0], clean_up_tokenization_spaces=False))

I simply called this model without any complex processing.
The output is

Python
user_name = 'wrc'
email='iuewfn@gmail.com'
data=abcdefg
gressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgress

It's like this model hasn't undergone any training, or I used the wrong token

May I ask what wrong I did?

ruochenwang changed discussion title from starpii doesn't have any meanful output to starpii doesn't have any meaningful output Oct 9, 2023

z-a-f

Dec 11, 2023

You should use a 'ner' pipeline instead of the Causal LM

starpii doesn't have any meaningful output

Hi,I tried calling starpii to detect personal information in the code, such as name and email.My codes as shown below:

Hi,
I tried calling starpii to detect personal information in the code, such as name and email.
My codes as shown below: