Batch execution of trt model


I use yolov8 for detection. Model is converted to TRT and run successfully on Jetson nano GPU.
Next, I would like to run inference on batch of 3 images. Do you have tutorial on how to modify my code or at least which functions I should use for batch inference?

Simple inference of yolov8 on Jetson nano:
void YOLOv8::infer()

this->device_ptrs[0], nchw.ptr(), * nchw.elemSize(), cudaMemcpyHostToDevice, this->stream));
this->context->enqueueV2(this->, this->stream, nullptr);
for (int i = 0; i < this->num_outputs; i++) {
size_t osize = this->output_bindings[i].size * this->output_bindings[i].dsize;
this->host_ptrs[i], this->device_ptrs[i + this->num_inputs], osize, cudaMemcpyDeviceToHost, this->stream));




The tensorrt environment supported by Jetson nano is JetPack 4.6.4 (production release), TRT , CUDA 10.2.300, cuDNN, L4t 32.7.4, ubuntu 18.04, opencv 4.1.1

Relevant Files

The source code above was customized from this public repo: GitHub - triple-Mu/YOLOv8-TensorRT: YOLOv8 using TensorRT accelerate !

Hi @fadoughou ,
The C++ and Python APIs are designed for batch input. The
IExecutionContext::execute (IExecutionContext.execute in Python)
and IExecutionContext::enqueue (IExecutionContext.execute_async
in Python) methods take an explicit batch size parameter. The maximum batch
size should also be set for the builder when building the optimized network with
IBuilder::setMaxBatchSize (Builder.max_batch_size in Python). When calling
IExecutionContext::execute or enqueue, the bindings passed as the bindings
parameter are organized per tensor and not per instance. In other words, the data for
one input instance is not grouped together into one contiguous region of memory.
Instead, each tensor binding is an array of instance data for that tensor.
Another consideration is that building the optimized network optimizes for the given
maximum batch size. The final result will be tuned for the maximum batch size but
will still work correctly for any smaller batch size. It is possible to run multiple build
operations to create multiple optimized engines for different batch sizes, then choose
which engine to use based on the actual batch size at runtime.