Problems with last_hidden_state
Hello
I am currently using the CLAP model from the Transformers library to compare embeddings between assembly code descriptions and text prompts. However, I'm encountering an issue where I can't access the last_hidden_state
attribute from the model's output.
Is anyone else facing the same problem?
Thanks a lot.
Hi
I think this is because the model output defined in clap_modeling.py is a tensor. If you want to use CLAP model to compare embeddings between assembly code descriptions and text prompts, you can refer to the provided sample code in model card without accessing last_hidden_state.
import torch.multiprocessing
import torch
import json
from transformers import AutoModel, AutoTokenizer
device = torch.device("cuda")
asm_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-asm", trust_remote_code=True)
text_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-text", trust_remote_code=True)
asm_encoder = AutoModel.from_pretrained("hustcw/clap-asm", trust_remote_code=True).to(device)
text_encoder = AutoModel.from_pretrained("hustcw/clap-text", trust_remote_code=True).to(device)
bubble_output = "./CaseStudy/bubblesort.json"
# load bubblesort.json
with open(bubble_output) as fp:
asm = json.load(fp)
prompts = [
"This is a function related to bubble sort ",
"This is a function related to selection sort",
"This is a function related to insertion sort",
"This is a function related to merge sort",
"This is a function related to quick sort",
"This is a function related to radix sort",
"This is a function related to shell sort",
"This is a function related to counting sort",
"This is a function related to bucket sort",
"This is a function related to heap sort",
]
with torch.no_grad():
asm_input = asm_tokenizer([asm], padding=True, pad_to_multiple_of=8, return_tensors="pt", verbose=False)
asm_input = asm_input.to(device)
asm_embedding = asm_encoder(**asm_input)
with torch.no_grad():
text_input = text_tokenizer(prompts, padding=True, truncation=True, return_tensors='pt')
text_input = text_input.to(device)
text_embeddings = text_encoder(**text_input)
logits = torch.einsum("nc,ck->nk", [asm_embedding, text_embeddings.T])
_, preds = torch.max(logits, dim=1)
preds = torch.softmax(logits / 0.07, dim=1).squeeze(0).tolist()
print("bubblesort zeroshot:")
for i in range(len(prompts)):
print(f"Probability: {round(preds[i]*100, 3)}%, Text: {prompts[i]}")
I am grateful for your quick answer, which proved to be of major help. I had assumed that the last_hidden_state would be the most effective means of obtaining the most accurate results.
Furthermore, I am currently engaged in the development of AI tools for the correction of assembly code.
I would be grateful if you could spare some time to discuss this with me some ideas.
Sure, we can use this channel for discussion or email me :)