I have some troubles in understanding how batch inference is working and how i can use it.
I’m using TensorRT to infer a U-Net shaped network model.
I want to perform the inference on a very big image. Therefor it is cutted into patches with a typical size of 1024x1024 or 512 x 512 (depends on used U-Net).
My patches are located inside a std::vector. I.e. the pointers to each patch buffer. Image data is row-major.
For now I iterate through the vector. In each iteration I copy the patch-buffer to my cuda device, perform the inference with batch Size = 1 and copy the result back to host memory.
Inference working fine, but, as expected, performance is bad.
Now I want to improve the performance, by increasing the batch size.
But I can’t figure out how to copy the patches buffers into device’s memory.
My idea is to allocate memory on device of the size
patchWidth * patchHeight * patchChannels * sizeof(datatype) * batchSize
and then copy n buffers into this continuous memory from host to device (with batchSize = n).
How does the execution context access this continuous memory? Will it determine the images/patches dimension depending on the network input dimension?
Or do I have to allocate multiple input/output cuda buffers for higher batch sizes (i.e. n input buffers for a batchSize = n)
Thanks in advice
TensorRT Version: 5
GPU Type: RTX 2070
CUDA Version: 10
Operating System + Version: Windows 7