In actual testing, compared to using fp16, it is only less than 10% faster
Load:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
model.to("cuda")
model.eval()
model.half()
pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to("cuda")
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
print(scores)
If use ONNX GPU Runtime with O4, it will fast than ctranslate2.
Bellow is a quick benchmark (on A10 GPU).
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
import time
import torch
device_mapping="cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("./onnxO4_bge_reranker_large")
model = ORTModelForSequenceClassification.from_pretrained("./onnxO4_bge_reranker_large").to(device_mapping)
pairs = [['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]*1024
t0 = time.time()
with torch.no_grad():
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to(device_mapping)
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
t1 = time.time()
print(f"Seconds: {t1-t0}")
# Seconds: 1.3976035118103027
I tried to convert the model weights using both O3 and O4 (--device cuda
), I encountered some issues but anyway using both the average time for a batch of 1024
was 1.39 seconds VS 0.8
for ctranslate2
and 0.9
for fp16
. It seems like fp16
is definitely a good competitor! Have you tried converting the weights to ONNX O4 and benchmark too?
I tried to convert the model weights using both O3 and O4 (
--device cuda
), I encountered some issues but anyway using both the average time for a batch of1024
was 1.39 seconds VS0.8
forctranslate2
and0.9
forfp16
. It seems likefp16
is definitely a good competitor! Have you tried converting the weights to ONNX O4 and benchmark too?
https://colab.research.google.com/drive/1HP9GQKdzYa6H9SJnAZoxJWq920gxwd2k?usp=sharing
bge-rerank-base: ONNX O4 is 2x fast than fp16.
bge-rerank-large: same result