How to achieve faster inference speed?
I ran the code on 4090 according to the Sample inference code, but found that it seems to take about 3.14-4.85 seconds to complete inference. I wonder if there is any way to speed up inference?
Am AFK now so will need to time it later to compare. Purely anecdotally I found it seemed quite snappy on my 3090.
Also that timing isn't bad compared to the free trial speeds on Azure AI (granted that's using shared resources thus you probably wouldn't expect it to be particularly fast): https://ai.azure.com/explore/models/Phi-3-vision-128k-instruct/version/1/registry/azureml
It would be interesting to analyse the impact of downsampling images, to see if there is a sweet spot for improved speed at an acceptable loss in reading/observation accuracy (assuming this does improve speed at all).
I changed in config "torch_dtype": "bfloat16", to float16. provided some marginal improvements in speed. For ex on rtx 4090, from an average of 1.36s to 1.25s. For rtx 3090 from 2s to 1.8s.
Also depends on the OS. On Windows, on 4080S, got about 3 sec. Same scripts (slightly modified python example from huggingface) on Linux 1.66. Have no idea why - disclaimer: these were different machines. But I had the same behavior on my home GTX 1080 - from 30 sec on windows, down to about 22 on WSL on the same machine.
sorry, maybe I did not explain it correctly, my results are for my own images with my custom prompt, but based on the python example. Just changed prompt and image. So it would depend what image and prompt you provide. The results I provided are only to show the comparative difference when using float16, and between different GPUs I tested.
I did not try any cloud dedicated GPUs. I found kind of a GPU sharing service, and that allowed me to test on different consumer grade GPUs as I was wondering what's the minimum budget for my scenario.
I do want to try ONNX (I've seen your post), but not a priority for me right now, as for the moment I gave up on Phi 3 Vision and I'm using ChatGpt-4o - more cost effective for my scenario, though the response times are worse.