Batch Inference Wrong in Python API

ethanyhzhang · February 18, 2020, 4:55am

I have a TF model format as pb and convert to onnx with shape (1, 112, 112, 3), then using onnx2trt to generate a model.trt engine file. When I use python api to inference with the engine, it seems to work fine when I only passed an image to it. But when I set the batch_size > 1 (e.g. 5), the output except the first sample are all zeros.
by the way, if I bock the line #inputs[0].host=images, which means no input is passed. there is also output for the first sample.

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit # must added after upper line

trt_engine_path = 'model.trt'

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def allocate_buffers(engine):
    """
    Allocates all buffers required for the specified engine
    """
    inputs = []
    outputs = []
    bindings = []
    # Iterate over binding names in engine
    for binding in engine:
        # Get binding (tensor/buffer) size
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        # Get binding (tensor/buffer) data type (numpy-equivalent)
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate page-locked memory (i.e., pinned memory) buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        # Allocate linear piece of device memory
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings
        bindings.append(int(device_mem))
        # Append to inputs/ouputs list
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    # Create a stream (to eventually copy inputs/outputs and run inference)
    stream = cuda.Stream()
    return inputs, outputs, bindings, stream

def infer(context, bindings, inputs, outputs, stream, batch_size=1):
    """
    Infer outputs on the IExecutionContext for the specified inputs
    """
    # Transfer input data to the GPU
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference
    flag = context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    if flag:
        print('execute successfully.')
    # Transfer predictions back from the GPU
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return the host outputs
    return [out.host for out in outputs]


# Read the serialized ICudaEngine
with open(trt_engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
    # Deserialize ICudaEngine
    engine = runtime.deserialize_cuda_engine(f.read())

print('engine.has_implicit_batch_dimension:',engine.has_implicit_batch_dimension)

# Now just as with the onnx2trt samples...
# Create an IExecutionContext (context for executing inference)
with engine.create_execution_context() as context:
    # Allocate memory for inputs/outputs
    inputs, outputs, bindings, stream = allocate_buffers(engine)
    # Set host input to the image
    images = np.random.rand(5, 112, 112, 3).astype(np.float32)
    #inputs[0].host = images
    # Inference
    trt_outputs = infer(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=5)
    # Prediction
    #print(len(trt_outputs))
    print(trt_outputs[0].shape)
    #print(trt_outputs[0][:200])
    rt = np.reshape(trt_outputs[0], (32,-1))
    print(rt.shape)
    print(rt[:,:5])

SunilJB · February 18, 2020, 5:33am

Hi,
Can you provide the following information so we can better help?
Provide details on the platforms you are using:
o Linux distro and version
o GPU type
o Nvidia driver version
o CUDA version
o CUDNN version
o Python version [if using python]
o Tensorflow version
o TensorRT version
o If Jetson, OS, hw versions

Also, if possible please share the script and model file to reproduce the issue.

Thanks

ethanyhzhang · February 18, 2020, 6:01am

linux version: Ubuntu 16.04.5
GPU type: 1080Ti
Nvidia driver version: 410.93
CUDA version: 10.0
CUDNN version:7.6.3
Python version:3.6
tensorflow version:1.9
TensorRT version:7.0

model.trt file 链接:https://pan.baidu.com/s/1CM0sIrXxN9cCipYT5eaIuA
extract code:5qz0

the script is above, unblock the line #81 if using the data.

thank you very much. this issue has bordered me so long. thanks again!
model.trt.zip (10.7 MB)

SunilJB · February 18, 2020, 8:21am

I am not able to access the model, could you please send it as zip attachment in the forum?

Thanks

ethanyhzhang · February 18, 2020, 8:30am

Hi, I’ve uploaded again. Please check it. thanks for your immediate reply.
model.trt.zip (10.7 MB)

SunilJB · February 18, 2020, 9:42am

Hi,

The model you shared is .trt engine file, in order to reproduce the issue we will need the ONNX model to regenerate the engine. Could you please share the ONNX model file and onnx2trt conversion code?

Also, an engine can be reused for a new input, if the engine batch size is greater than or equal to the batch size of new input.
You can also make use of profiles if you need to support multiple input dimensions. Please refer below link for more details:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-700/tensorrt-developer-guide/index.html#opt_profiles

Thanks

ethanyhzhang · February 18, 2020, 9:47am

with the onnx file model.onnx, I simply use the convert command with default parameters

onnx2trt model.onnx -o model.trt

model.onnx.zip (10.8 MB)

SunilJB · February 18, 2020, 4:55pm

Hi,

Since the engine is build fixed ONNX input, then I do not think you can set up batch any more.
Also in TRT 7, the ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set.

Thanks

ethanyhzhang · February 19, 2020, 1:59am

Sorry, I might not fully get your point. It confused me what the full -dimensions mode mean. Did you mean I should generate ONNX model with input shape of (-1, 112, 112, 3) with the first dimension to be -1? And I’ve looked into the script of onnx2trt or version 7.0, it indeed created the network with explicitBatch flat set.

Is the key point how I create the onnx file? Setting it to be (1, 112, 112, 3) is not correct?

Thanks again!

ethanyhzhang · February 19, 2020, 10:01am

Hi,

I’ve tried some C++ samples provided by tensorRT, and I found something strange.
When I tried sampleONNXMNIST, which uses ONNXParser the construct the network and engine, the same problem happened during batch inference.
When I tried sampleMNITAPI, which uses TensorRT Network API to construct network and engine, it works fine with batch inference. And the difference is that the input set is dims3.

So the problem is indeed as you mentioned. I should not use fixed onnx. but here comes another issue. How do I create a onnx model with input dim of 3? Since model trained by other framework is of dims4, so as the converted onnx model.

Must I use Network API to reconstruct the network again if I want to use batch inference?

Please tell me more about the details.

Really appreciate your help and thanks a lot!!

SunilJB · February 19, 2020, 11:04am

Hi,

Please refer below link in case it helps:
https://github.com/onnx/onnx/issues/654#issuecomment-410538671

Thanks

ethanyhzhang · February 19, 2020, 12:40pm

Hi,
I guess I found the problem.
Since I use tensorRT 7.0, the OnnxParser do not support onnx with (None, h,w,c), even if builder.create_network() without explicit batch;
if I use tensorRT 6.0, I can parse onnx with the batch_size to be None without explicit_batch in builder.create_network().

So, May I confirm that TensorRT 7.0 <b>only support</b> full-dimension with explicit batch?

Thanks~

SunilJB · February 19, 2020, 5:30pm

ONNX parser with dynamic shapes support:
The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set.

Please refer below link for more details:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-700/tensorrt-release-notes/tensorrt-7.html#rel_7-0-0

Thanks

st19930921 · February 19, 2020, 7:35pm

I ran into the same error. Anyone found a solution for this problem?

ethanyhzhang · February 20, 2020, 2:04am

It’s the problem with tensorRT 7.0. Please use 6.0.

Topic		Replies	Views
Work with batch in TensorRT TensorRT tensorrt , opencv , cuda , tensorflow	20	3803	July 20, 2021
Tensorrt8.5 inference different with origin onnx model TensorRT	6	1085	December 13, 2022
Inference error while using tensorrt engine on jetson nano Jetson Nano tensorrt , nvbugs	23	3614	April 20, 2022
ONNX model and TensorRT engine works differently TensorRT	5	728	February 20, 2023
How to use different profile in tensorrt? TensorRT tensorrt , python	3	1392	July 19, 2022
tensorrt's onnx parser can't parse the output layer correctly TensorRT	12	4852	November 24, 2021
Engine Plan Inference on JetsonTX2 Jetson TX2 tensorrt , python	11	1844	October 18, 2021
TensorRT Batch Inference: different results TensorRT	4	4217	December 1, 2021
Onnx with dynamic batch cannot be parsed TensorRT tensorrt	12	1529	August 9, 2021
Inference multiple images TensorRT TensorRT	8	2258	November 9, 2020

Batch Inference Wrong in Python API

Related topics