google/vit-base-patch16-224-in21k · Why the embedding on CPU is far faster than GPU?

Jun 10

Hello guys!
I am playing around with ViT inference speed. I have tested the cost time of embedding and encoding on CPU and GPU separately.
The results, out of my expectation, are:

batch_size	embedding device	encoder device	img_processor time	embedding time	encoder time	total time
1	CPU	CPU	4	1	71	73
1	CPU	GPU	3	1	164	166
1	GPU	GPU	4	330	5	349
16	GPU	GPU	47	319	7	326
16	CPU	CPU	54	8	961	970

time unit: ms
GPU model = RTX 3090Ti
CPU model = Intel i9-12900KF
Pretrained model weights = google/vit-base-patch16-224-in21k

I can understand that GPU is faster than CPU for encoding. But

Why is it faster than the GPU for embedding since both embedding and encoding are some DL neural networks and do matrix multiplying operations?
When I use CPU for embedding and GPU for encoding, I found I can save time for embedding but lose some time for encoding. I can also not explain why.

Here I attached my test class and you may need to use some .to(device) to modify both the class and ViTModel to specify where the code is running:

import time

from PIL import Image
from transformers import ViTImageProcessor, ViTModel


class ObjectDetector:
    def __init__(self, cuda_device='cuda:0'):
        self.device = cuda_device

        self.img_processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224-in21k')

        self.model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k').eval()
        # self.model.embeddings = self.model.embeddings.to('cpu').eval()

    # load single or multiple images
    def extract(self, image):
        before_time = time.time()
        inputs = self.img_processor(images=image, return_tensors="pt")
        after_img_processor = time.time()
        outputs = self.model(**inputs)
        after_model = time.time()
        print(f"Time taken for image processor: {after_img_processor - before_time}")
        print(f"Time taken for model: {after_model - after_img_processor}")
        return outputs


def main():
    detector = ObjectDetector()
    images = [Image.open(f'/home/rowan/source/edge-apps/datasets/batch/{i}.jpg') for i in range(16) ]
    outputs = detector.extract(images)
    print(outputs.keys())

if __name__ == "__main__":
    main()

Any comments and discussion will be appreciated!

ViGeng

Jun 10

Okay, finally I found:

The first batch inference will appear slowly but the following batches will act as expected.

selamw

Google org Jun 10

@ViGeng
You're absolutely right. CPU embedding might seem faster for ViT model's first inference due to avoided GPU memory allocation and data transfer. But this is deceptive. Subsequent GPU inferences are significantly faster due to cached memory and optimized data transfers.

batch_size	embedding device	encoder device	img_processor time	embedding time	encoder time	total time
1	CPU	CPU	4	1	71	73
1	CPU	GPU	3	1	164	166
1	GPU	GPU	4	330	5	349
16	GPU	GPU	47	319	7	326
16	CPU	CPU	54	8	961	970

batch_size	embedding device	encoder device	img_processor time	embedding time	encoder time	total time
1	CPU	CPU	4	1	71	73
1	CPU	GPU	3	1	164	166
1	GPU	GPU	4	330	5	349
16	GPU	GPU	47	319	7	326
16	CPU	CPU	54	8	961	970

batch_size	embedding device	encoder device	img_processor time	embedding time	encoder time	total time
1	CPU	CPU	4	1	71	73
1	CPU	GPU	3	1	164	166
1	GPU	GPU	4	330	5	349
16	GPU	GPU	47	319	7	326
16	CPU	CPU	54	8	961	970