In actual testing, compared to using fp16, it is only less than 10% faster

#1
by swulling - opened

Load:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
model.to("cuda")
model.eval()
model.half()

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to("cuda")
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

If use ONNX GPU Runtime with O4, it will fast than ctranslate2.

hi @swulling thanks for your comment. I will do some extensive testing today to benchmark against ONNX O4.

@swulling

Bellow is a quick benchmark (on A10 GPU).

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
import time
import torch

device_mapping="cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("./onnxO4_bge_reranker_large")
model = ORTModelForSequenceClassification.from_pretrained("./onnxO4_bge_reranker_large").to(device_mapping)

pairs = [['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]*1024
t0 = time.time()
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to(device_mapping)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
t1 = time.time()
print(f"Seconds: {t1-t0}")

# Seconds: 1.3976035118103027

I tried to convert the model weights using both O3 and O4 (--device cuda), I encountered some issues but anyway using both the average time for a batch of 1024 was 1.39 seconds VS 0.8 for ctranslate2 and 0.9 for fp16. It seems like fp16 is definitely a good competitor! Have you tried converting the weights to ONNX O4 and benchmark too?

I tried to convert the model weights using both O3 and O4 (--device cuda), I encountered some issues but anyway using both the average time for a batch of 1024 was 1.39 seconds VS 0.8 for ctranslate2 and 0.9 for fp16. It seems like fp16 is definitely a good competitor! Have you tried converting the weights to ONNX O4 and benchmark too?

@hooman650

https://colab.research.google.com/drive/1HP9GQKdzYa6H9SJnAZoxJWq920gxwd2k?usp=sharing

bge-rerank-base: ONNX O4 is 2x fast than fp16.
bge-rerank-large: same result

infoflow 2023-11-21 16-30-38.png

Sign up or log in to comment