Multiple batches in C++ API


So far I have been working with a neural network using a single batch size in the C++ API. Additionally, due to the way I am manipulating data in the rest of my program, I already have data in the GPU and am therefore not performing any host to device copies. My program structure is as follows:

int inputIndex = mEngine->getBindingIndex("Input");
int outputIndex = mEngine->getBindingIndex("Output");

assert(inputIndex == 0 || inputIndex == 1);
assert(outputIndex == 0 || outputIndex == 1);

// Double check that the layers above are really input and output
assert(mEngine->bindingIsInput(inputIndex) == true);
assert(mEngine->bindingIsInput(outputIndex) == false);

void* buf[2];
buf[inputIndex] = gpuTensorInput; // gpuTensorInput is a void*
buf[outputIndex] = tensorOutput; // tensorOutput is a void*
bool status = context->executeV2(buf);

Right now, I am feeding in a gpu pointer for the input and another gpu pointer for the output, for a single batch. buf feeds these pointers into the executeV2 method to run inference on the network. This is working fine.

Now my question is what if I would like to increase the batch size? I will always do fixed size inference (no need to dynamic input shapes) so my plan was to alter the size in the ONNX model (really just the first dimension) I import from. So correct me if I am wrong, but if the original input was (1,3,224,224) then if I wanted a batch size of say 32, I would resize the dimensions of the input layer to be (32,3,224,224), and I would set the max batch size by doing something like:

params.batchSize = 32;

But I am unsure how to feed in pointers for these multiple batches as I do above. Would I have to feed in 32 pointers for the input/output tensors? Or does it work that if batching, I would first have to put all the input data into a contiguous CUDA buffer of size 323224*224 and give the pointer to the first element to the execute method (as opposed to the executeV2 method).


TensorRT Version: 7.1
CUDA Version: 11.0
CUDNN Version: 8.0.2
Operating System + Version: Windows 10

Hi @solarflarefx,
I am checking on this and will get back to you shortly.
Thank you for your patience.

1 Like

Hi @solarflarefx,

You should go with this option.
You have to prepare batch * channelheightwidth size of input/output buffer memory.
Please refer to teh below example for the same


@AakankshaS Thanks for the reply.

Perhaps I have an atypical case here, but what about the situation when the first input and second input overlap in memory?

Again in my situation I already have the full input inside a large block of GPU memory. I have been feeding the input to the engine and getting an output as I have described in my original post (by just giving the pointer of the starting location of the input of the particular iteration).

In my situation, in terms of GPU memory, there is memory space overlap between say the first input and the second input. I am trying to illustrate this below:

Feeding the gpu input pointer for inference was not a problem for a batch size of 1, as I just fed the pointer of the starting memory block to the TensorRT engine. However, unless I am mistaken it seems that the TensorRT API expects inputs to be in unique non-overlapping memory spaces when using multibatching:

Am I wrong here? Or if there another way to feed multibatches to the TensorRT engine? Something like feed in the pointer value to each input/output in the batch?

Hi @solarflarefx,
I am afraid, we do not support input overlap cases. You should provide non-overlap input if you want to use TRT’s multiple batch.

Thanks for your reply. So I tried multibatching. I feed in the batches and run the inference fewer times and got the same result as using a single batch (I am looping the whole input).

What doesn’t seem to make sense is that if I set the batch size to 2, the inference time doubles. I used NSight to confirm this. I am not sure what I am doing wrong, but I will note that since my data is in the GPU, I am not doing the following lines like in the example you have shown:

// Memcpy from host input buffers to device input buffers

bool status = context->executeV2(buffers.getDeviceBindings().data());
if (!status)
    return false;

// Memcpy from device output buffers to host output buffers

Instead, I first allocate data to GPU memory and feed in the pointer as follows:

void* buf[2];
buf[inputIndex] = gpuTensorInput;
buf[outputIndex] = gpuTensorOutput;
bool status = context->executeV2(buf);

I do this within a loop and change gpuTensorInput and gpuTensorOutput. In the end I get the correct answer and the same answer whether I use a batch size of 1 or 2.

In order to create a batch size of 2, I created a different .onnx file with fixed input size to reflect the batch size of 2. I then imported the .onnx model and created an engine making sure to set the following:

params.batchSize = 2;

In the end the code performs half as many inferences, but they are twice as slow. I am not sure what I am doing wrong here.

I have a feeling it has to do with passing in a pointer instead of using the bindings structure with host to device copies. But I am trying to avoid this overhead since my input is already in the GPU.

@solarflarefx hi solarflarefx:
reference this