Inference multiple images TensorRT

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version - 7.0:
GPU Type - RTX 2080:
Nvidia Driver Version -> 440.64.00:
CUDA Version - 10.2:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
**Baremetal or Container (if container which image + tag) -> nvcr.io/nvidia/tensorrt:20.03-py3 **:

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

I can successfully inference a single image, but as soon as I loop through a list of images the output of the first image is copied in the output of other images. Below is the related code:

# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]


TRT_LOGGER = trt.Logger()


def get_engine(engine_path):
    # If a serialized engine exists, use it instead of building an engine.
    print("Reading engine from file {}".format(engine_path))
    with open(engine_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())


def build_engine(onnx_path, using_half, engine_file="yolov5_1_fp32_common.engine"):
    if os.path.exists(engine_file):
        return get_engine(engine_file)
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        builder.max_batch_size = 1 # always 1 for explicit batch
        config = builder.create_builder_config()
        config.max_workspace_size = GiB(1)
        if using_half:
            config.set_flag(trt.BuilderFlag.FP16)
        # Load the Onnx model and parse it in order to populate the TensorRT network.
        with open(onnx_path, 'rb') as model:
            if not parser.parse(model.read()):
                print ('ERROR: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print (parser.get_error(error))
                return None
        return builder.build_engine(network, config)

def detect_yolo(engine, context, buffers, image_src, image_size):
    IN_IMAGE_H, IN_IMAGE_W = 640, 640
    dataset = LoadImages(image_src, img_size=image_size)

    for path, img, im0s, vid_cap in dataset:
        print(path)
        input_img = img.astype(np.float)
        input_img /= 255.0
        input_img = np.expand_dims(input_img, axis=0)

        img = torch.from_numpy(input_img).float().numpy()
        # print(img.shape)
        trt_output = detect(engine, context, buffers, img)


def detect(engine, context, buffers, img_in):
    ta = time.time()
    print("Shape of the network input: ", img_in.shape)
    # print(img_in)
    inputs, outputs, bindings, stream = buffers
    print('Length of inputs: ', len(inputs))
    inputs[0].host = img_in
    print(outputs[-1])
    trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

    print('Len of outputs: ', len(trt_outputs))
    num_classes = 80
    # print(trt_outputs)
    trt_output = trt_outputs[0].reshape(1, -1, 5 + num_classes)

    tb = time.time()

    print(trt_output.shape)

    print('-----------------------------------')
    print('    TRT inference time: %f' % (tb - ta))
    print('-----------------------------------')
    return trt_output


def main():
    """
    """
    using_half = False
    with build_engine(opt.onnx, False) as engine, engine.create_execution_context() as context:
        buffers = allocate_buffers(engine)

        detect_yolo(engine, context, buffers, opt.source, opt.img_size)

Thanks in advance. I want to inference multiple images successfully. I think this has something to do with context.enqueue. I am not quite sure

Hi @y14uc339,
TRT >= 7 requires EXPLICIT_BATCH for ONNX, for fixed-shape model, the batch size is fixed.
You may refer the link for the same.
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes
However, please share your model file to reproduce the issue so we can help better.
Thanks!

@AakankshaS
trtexec --onnx=onnx_file --explicitBatch --saveEngine=save file
Used the above command to create engine.
I think I used explicitBatch, but when I inference on multiple images in a loop it copies the output of first image in all those follow, created the onnx with batch size 1

I can share the engine file if you want?

Hi @y14uc339, kindly share the engine file.
Thanks!

Hi @AakankshaS, Please find the engine file:

@AakankshaS Also I see this in the Terminal after the first image is inferenced. FO rall images that follow:

[TensorRT] WARNING: Explicit batch network detected and batch size specified, use 
enqueue without batch size instead.
[TensorRT] ERROR: ../rtExt/cuda/slice.cu (141) - Cuda Error in launchNaiveSliceImpl: 
400 (invalid resource handle)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception

@AakankshaS is it a bug or something because I cannot really use dynamic shapes with onnx as onnx-simplifier doesn’t support dynamic shapes! But multiple images inference shouldn’t be a problem

Hi,
Apologies for late response.
Can you please share your onnx model.

Thanks!