Batch Inference Wrong in Python API

I have a TF model format as pb and convert to onnx with shape (1, 112, 112, 3), then using onnx2trt to generate a model.trt engine file. When I use python api to inference with the engine, it seems to work fine when I only passed an image to it. But when I set the batch_size > 1 (e.g. 5), the output except the first sample are all zeros.
by the way, if I bock the line #inputs[0].host=images, which means no input is passed. there is also output for the first sample.

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit # must added after upper line

trt_engine_path = 'model.trt'

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def allocate_buffers(engine):
    """
    Allocates all buffers required for the specified engine
    """
    inputs = []
    outputs = []
    bindings = []
    # Iterate over binding names in engine
    for binding in engine:
        # Get binding (tensor/buffer) size
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        # Get binding (tensor/buffer) data type (numpy-equivalent)
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate page-locked memory (i.e., pinned memory) buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        # Allocate linear piece of device memory
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings
        bindings.append(int(device_mem))
        # Append to inputs/ouputs list
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    # Create a stream (to eventually copy inputs/outputs and run inference)
    stream = cuda.Stream()
    return inputs, outputs, bindings, stream

def infer(context, bindings, inputs, outputs, stream, batch_size=1):
    """
    Infer outputs on the IExecutionContext for the specified inputs
    """
    # Transfer input data to the GPU
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference
    flag = context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    if flag:
        print('execute successfully.')
    # Transfer predictions back from the GPU
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return the host outputs
    return [out.host for out in outputs]


# Read the serialized ICudaEngine
with open(trt_engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
    # Deserialize ICudaEngine
    engine = runtime.deserialize_cuda_engine(f.read())

print('engine.has_implicit_batch_dimension:',engine.has_implicit_batch_dimension)

# Now just as with the onnx2trt samples...
# Create an IExecutionContext (context for executing inference)
with engine.create_execution_context() as context:
    # Allocate memory for inputs/outputs
    inputs, outputs, bindings, stream = allocate_buffers(engine)
    # Set host input to the image
    images = np.random.rand(5, 112, 112, 3).astype(np.float32)
    #inputs[0].host = images
    # Inference
    trt_outputs = infer(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=5)
    # Prediction
    #print(len(trt_outputs))
    print(trt_outputs[0].shape)
    #print(trt_outputs[0][:200])
    rt = np.reshape(trt_outputs[0], (32,-1))
    print(rt.shape)
    print(rt[:,:5])

Hi,
Can you provide the following information so we can better help?
Provide details on the platforms you are using:
o Linux distro and version
o GPU type
o Nvidia driver version
o CUDA version
o CUDNN version
o Python version [if using python]
o Tensorflow version
o TensorRT version
o If Jetson, OS, hw versions

Also, if possible please share the script and model file to reproduce the issue.

Thanks

linux version: Ubuntu 16.04.5
GPU type: 1080Ti
Nvidia driver version: 410.93
CUDA version: 10.0
CUDNN version:7.6.3
Python version:3.6
tensorflow version:1.9
TensorRT version:7.0

model.trt file 链接:https://pan.baidu.com/s/1CM0sIrXxN9cCipYT5eaIuA
extract code:5qz0

the script is above, unblock the line #81 if using the data.

thank you very much. this issue has bordered me so long. thanks again!
model.trt.zip (10.7 MB)

I am not able to access the model, could you please send it as zip attachment in the forum?

Thanks

Hi, I’ve uploaded again. Please check it. thanks for your immediate reply.
model.trt.zip (10.7 MB)

Hi,

The model you shared is .trt engine file, in order to reproduce the issue we will need the ONNX model to regenerate the engine. Could you please share the ONNX model file and onnx2trt conversion code?

Also, an engine can be reused for a new input, if the engine batch size is greater than or equal to the batch size of new input.
You can also make use of profiles if you need to support multiple input dimensions. Please refer below link for more details:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-700/tensorrt-developer-guide/index.html#opt_profiles

Thanks

with the onnx file model.onnx, I simply use the convert command with default parameters

onnx2trt model.onnx -o model.trt

model.onnx.zip (10.8 MB)

Hi,

Since the engine is build fixed ONNX input, then I do not think you can set up batch any more.
Also in TRT 7, the ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set.

Thanks

Sorry, I might not fully get your point. It confused me what the full -dimensions mode mean. Did you mean I should generate ONNX model with input shape of (-1, 112, 112, 3) with the first dimension to be -1? And I’ve looked into the script of onnx2trt or version 7.0, it indeed created the network with explicitBatch flat set.

Is the key point how I create the onnx file? Setting it to be (1, 112, 112, 3) is not correct?

Thanks again!

Hi,

I’ve tried some C++ samples provided by tensorRT, and I found something strange.
When I tried sampleONNXMNIST, which uses ONNXParser the construct the network and engine, the same problem happened during batch inference.
When I tried sampleMNITAPI, which uses TensorRT Network API to construct network and engine, it works fine with batch inference. And the difference is that the input set is dims3.

So the problem is indeed as you mentioned. I should not use fixed onnx. but here comes another issue. How do I create a onnx model with input dim of 3? Since model trained by other framework is of dims4, so as the converted onnx model.

Must I use Network API to reconstruct the network again if I want to use batch inference?

Please tell me more about the details.

Really appreciate your help and thanks a lot!!

Hi,

Please refer below link in case it helps:
https://github.com/onnx/onnx/issues/654#issuecomment-410538671

Thanks

Hi,
I guess I found the problem.
Since I use tensorRT 7.0, the OnnxParser do not support onnx with (None, h,w,c), even if builder.create_network() without explicit batch;
if I use tensorRT 6.0, I can parse onnx with the batch_size to be None without explicit_batch in builder.create_network().

So, May I confirm that TensorRT 7.0 <b>only support</b> full-dimension with explicit batch?

Thanks~

ONNX parser with dynamic shapes support:
The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set.

Please refer below link for more details:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-700/tensorrt-release-notes/tensorrt-7.html#rel_7-0-0

Thanks

I ran into the same error. Anyone found a solution for this problem?

It’s the problem with tensorRT 7.0. Please use 6.0.