How to organize memory for inference with batch size > 1


I have some troubles in understanding how batch inference is working and how i can use it.


I’m using TensorRT to infer a U-Net shaped network model.
I want to perform the inference on a very big image. Therefor it is cutted into patches with a typical size of 1024x1024 or 512 x 512 (depends on used U-Net).
My patches are located inside a std::vector. I.e. the pointers to each patch buffer. Image data is row-major.

For now I iterate through the vector. In each iteration I copy the patch-buffer to my cuda device, perform the inference with batch Size = 1 and copy the result back to host memory.
Inference working fine, but, as expected, performance is bad.

Now I want to improve the performance, by increasing the batch size.
But I can’t figure out how to copy the patches buffers into device’s memory.

My idea is to allocate memory on device of the size
patchWidth * patchHeight * patchChannels * sizeof(datatype) * batchSize
and then copy n buffers into this continuous memory from host to device (with batchSize = n).

How does the execution context access this continuous memory? Will it determine the images/patches dimension depending on the network input dimension?
Or do I have to allocate multiple input/output cuda buffers for higher batch sizes (i.e. n input buffers for a batchSize = n)

Thanks in advice


TensorRT Version: 5
GPU Type: RTX 2070
CUDA Version: 10
Operating System + Version: Windows 7

TRT internally uses bindings[0] , bindings[1] , ... to access those memory, so it could be a huge continuous memory and TRT doesn’t care about it.

In general, the calling sequence is:
cudaMallocbindings[0] , bindings[1] , …
cudaMemcpy → to bindings[0] , …
enqueue(batchSize, bindings, stream, ...)
Each binding includes batch dimensions.