Run inference on a batch of images & parallel inference using cuda on python threads


Hello ,

I have Nvidia Xavier and I’ve managed to convert SSD Mobile Net V2 to .trt and run inference following the steps in the below :

I have two inquiries :
-Is it possible to run an inference on a batch of images all at once ? and how to do this in python ? the in the above link only does this for single image at a time
-Is it possible to run parallel inference using cuda on python threads (I tried to do this but got broken pipe error) ? I want to run multiple thread or processes ,each doing an inference



TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

The below link might be useful for you
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.


Thank you for the links ,I will look into them .I just want to know if my inquiries are possible ?



Please refer the following link for dynamic shape inputs to give dynamic batch size.

Thank you.

Thanks for the reply
I still didn’t a clear response to my inquiry and I don’t see how the above link can help me .I tried to do batch inference using cuda stream and I was only able to get inference for the first image and the rest of the images result in zeros .Im using Tensorrt 8.0 on Xavier .Is batch inference possible using python .I’m using the following code ,can you please check and let me know how I can achieve batch inference

class TensorRTInfer:
Implements inference for the Model TensorRT engine.

def __init__(self, engine):
    :param engine_path: The path to the serialized engine to load from disk.

    # Load TRT engine
    self.cfx = cuda.Device(0).make_context() = cuda.Stream()
    self.engine = engine
    self.context = self.engine.create_execution_context()

    # Setup I/O bindings
    self.inputs1 = []
    self.outputs1 = []
    self.allocations1 = []

    for i in range(self.engine.num_bindings):
        name = self.engine.get_binding_name(i)
        dtype = self.engine.get_binding_dtype(i)
        shape = self.engine.get_binding_shape(i)
        size = np.dtype(trt.nptype(dtype)).itemsize * batch_size
        for s in shape:
            size *= s
        allocation1 = cuda.mem_alloc(size)

        binding1 = {
            'index': i,
            'name': name,
            'dtype': np.dtype(trt.nptype(dtype)),
            'shape': list(shape),
            'allocation': allocation1,


        if self.engine.binding_is_input(i):

    self.outputs2 = []
    for shape, dtype in self.output_spec():
        shape[0]=shape[0] *batch_size 
        self.outputs2.append(np.zeros(shape, dtype))
    print("done building..")

def input_spec(self):
    Get the specs for the input tensor of the network. Useful to prepare memory allocations.
    :return: Two items, the shape of the input tensor and its (numpy) datatype.
    return self.inputs[0]['shape'], self.inputs[0]['dtype']

def output_spec(self):
    Get the specs for the output tensors of the network. Useful to prepare memory allocations.
    :return: A list with two items per element, the shape and (numpy) datatype of each output tensor.
    specs = []
    for o in self.outputs1:
        specs.append((o['shape'], o['dtype']))
    return specs

def h_to_d(self, batch):
    self.batch = batch
    cuda.memcpy_htod_async(self.inputs1[0]['allocation'], np.ascontiguousarray(batch))   
def destory(self):
def d_to_h(self):      
    for o in range(len(self.outputs2)):
       cuda.memcpy_dtoh_async(self.outputs2[o], self.outputs1[o]['allocation'],
    return self.outputs2
def infer_this(self):

Any update please ?



Looks like your code not handling inference of batch properly.
Previous link I shared to give batch size (greater than 1) dynamically.
Please refer below sample to run inference on a batch of images,

Thank you.