Update modeling_nvembed.py
#49
by
lukelv
- opened
Update _do_encode(), the function can return FloatTensor and Numpy.
Hello Author,
I was so glad to work with your model when I tried to train and encode a list of sentences. I used _do_encode() to handle this and recognized that you converted each vector to a numpy array in each step of the for loop. It is okay, but I needed to return my vectors to Tensor, so I tried to modify the function like this:
@torch
.no_grad()
def _do_encode(self,
prompts: List[str],
batch_size: int=1,
instruction: str="",
max_length: int=4096,
num_workers: int=32,
**kwargs
) -> Union[np.ndarray, torch.FloatTensor]:
dataset: Dataset = Dataset.from_dict({'input_texts': prompts})
dataset.set_transform(partial(input_transform_func,
self.tokenizer,
always_add_eos=True,
max_length=max_length,
instruction=instruction))
data_collator = DataCollatorWithPadding(self.tokenizer)
data_loader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=False,
drop_last=False,
num_workers=num_workers,
collate_fn=data_collator,
pin_memory=True)
if self.padding_side == "right" and self.is_mask_instruction == True and len(instruction) > 0:
instruction_lens = len(self.tokenizer.tokenize(instruction))
else:
instruction_lens = 0
encoded_embeds = []
device = next(self.embedding_model.parameters()).device
for batch_dict in tqdm(data_loader, desc='encoding', mininterval=10):
features = self.prepare_kwargs_from_batch(batch_dict, instruction_lens, device=device)
embeds=self(**features)["sentence_embeddings"].squeeze(1)
encoded_embeds.append(embeds)
encoded_embeds = torch.cat(encoded_embeds, axis=0)
if "return_numpy" in kwargs and kwargs.get("return_numpy"):
encoded_embeds = encoded_embeds.cpu().detach().numpy()
return encoded_embeds
It can return 2 types of data now. Moreover, I recognize that the function can encode faster than the previous version because it just converts the tensor to Numpy array once after finishing for loop. You can consider this.
nada5
changed pull request status to
merged