I use tensorRT 3.0 in Tesla v100 environment, and I want to convert caffe1 model to predict.
However, my tests showed that there is a lot of difference in performance from Nvidia. So I would like to ask you some questions.
Does the benchmark’s performance by batch size include all the time it takes to construct the input data and infer the results?
Based on 5000 images, Transfer input data to GPU and inference time are low, but memcpy_dtoh_async, which returns the results, took a long time. Is there a way to improve performance?
I added in batch size to the list when I entered the image.
If my method is wrong, how do I know how to infer the batch size differently in the tensorRT 3.0 environment?
thank you for reading.