Problems with last_hidden_state

#1
by joaogomes24 - opened

Hello

I am currently using the CLAP model from the Transformers library to compare embeddings between assembly code descriptions and text prompts. However, I'm encountering an issue where I can't access the last_hidden_state attribute from the model's output.

Is anyone else facing the same problem?

Thanks a lot.

Owner

Hi

I think this is because the model output defined in clap_modeling.py is a tensor. If you want to use CLAP model to compare embeddings between assembly code descriptions and text prompts, you can refer to the provided sample code in model card without accessing last_hidden_state.

import torch.multiprocessing
import torch
import json
from transformers import AutoModel, AutoTokenizer

device = torch.device("cuda")

asm_tokenizer       = AutoTokenizer.from_pretrained("hustcw/clap-asm", trust_remote_code=True)
text_tokenizer      = AutoTokenizer.from_pretrained("hustcw/clap-text", trust_remote_code=True)
asm_encoder         = AutoModel.from_pretrained("hustcw/clap-asm", trust_remote_code=True).to(device)
text_encoder        = AutoModel.from_pretrained("hustcw/clap-text", trust_remote_code=True).to(device)

bubble_output       = "./CaseStudy/bubblesort.json"

# load bubblesort.json
with open(bubble_output) as fp:
    asm = json.load(fp)

prompts = [
    "This is a function related to bubble sort ",
    "This is a function related to selection sort",
    "This is a function related to insertion sort",
    "This is a function related to merge sort",
    "This is a function related to quick sort",
    "This is a function related to radix sort",
    "This is a function related to shell sort",
    "This is a function related to counting sort",
    "This is a function related to bucket sort",
    "This is a function related to heap sort",
]

with torch.no_grad():
    asm_input = asm_tokenizer([asm], padding=True, pad_to_multiple_of=8, return_tensors="pt", verbose=False)
    asm_input = asm_input.to(device)
    asm_embedding = asm_encoder(**asm_input)

with torch.no_grad():
    text_input = text_tokenizer(prompts, padding=True, truncation=True, return_tensors='pt')
    text_input = text_input.to(device)
    text_embeddings = text_encoder(**text_input)

logits = torch.einsum("nc,ck->nk", [asm_embedding, text_embeddings.T])
_, preds = torch.max(logits, dim=1)
preds = torch.softmax(logits / 0.07, dim=1).squeeze(0).tolist()

print("bubblesort zeroshot:")
for i in range(len(prompts)):
    print(f"Probability: {round(preds[i]*100, 3)}%, Text: {prompts[i]}")

I am grateful for your quick answer, which proved to be of major help. I had assumed that the last_hidden_state would be the most effective means of obtaining the most accurate results.

Furthermore, I am currently engaged in the development of AI tools for the correction of assembly code.
I would be grateful if you could spare some time to discuss this with me some ideas.

Owner

Sure, we can use this channel for discussion or email me :)

Sign up or log in to comment