How should batching be handled in TensorRT custom Plugin implementations. (Does TensoRT create seperate CUDA streams for each batch?)

I have written a custom TensorRT plugin which calls a custom CUDA kernel that operates on a single batch at a time (The kernel computes the output of a single batch of computation during each kernel launch). Inside the enqueue function of my TensorRT plugin, I invoke the kernel “batchSize” number of times inside a for loop as in the pseudo-code below:

int CustomPlugin::enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {

size_t inputOffset = … ;
size_t ouptutOffset = … ;
for (int i = 0; i < batchsize; i++) {
// Run the kernel on separately for each batch.
launchCustomCudaKernel(inputs[0] + i * inputOffset, ouptuts[0] + i * outputOffset);

The above pseudo-code of my plugin launches the CUDA kernel multiple times for each batch index which does not seem optimal. Therefore, if I want my Custom Plugin to run optimally, do I need to re-implement my CUDA kernel so that it can handle multiple batches internally (i.e. a single launch handles all batches). OR, does TensorRT internally create multiple CUDA streams for each batch, such that each batch index runs on a seperate stream in which case batchSize will always equal 1?


For optimal performance, I will recommend you to re-implement your CUDA kernel of custom plugin to handle multiple batches internally.