Question about Python tutorial

Description

I was trying to extend the example at TensorRT/tutorial-runtime.ipynb at master · NVIDIA/TensorRT · GitHub with batch support. I believe I did everything right, but out of a batch, only first item was assigned meaningful values. Other items in the output stayed 0.

Environment

TensorRT Version: 8.0.1.6
GPU Type: Tesla T4
Nvidia Driver Version: 450.119.03
CUDA Version: In container cuda-11.4
CUDNN Version:
Operating System + Version: ec2 instance
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Container nvcr.io/nvidia/tensorrt:21.08-py3

Steps To Reproduce

When invoking trtexec to convert the onnx model, I set shapes to allow a range of batch sizes.

trtexec --onnx=fcn-resnet101.onnx --explicitBatch --fp16 --workspace=5200 --minShape
s=input:1x3x1026x1282 --optShapes=input:2x3x1026x1282 --maxShapes=input:4x3x1026x1282 --buildOnly --saveEngine=fcn-resnet101.trt

I’d stack a batch of images together

BATCH_SIZE = 4
batch = np.stack([input_image] * BATCH_SIZE)

# In [21]: batch.shape
# Out[21]: (4, 3, 1026, 1282)

After context creation, I set binding shape:

model_path = "fcn-resnet101.trt"
print("Reading engine from file {}".format(model_path))
with open(model_path, "rb") as f, trt.Runtime(trt.Logger(trt.Logger.WARNING)) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())

context = engine.create_execution_context()
context.set_binding_shape(engine.get_binding_index("input"), (BATCH_SIZE, 3, image_height, image_width))

The call succeeds with True.

Then I run inference and evaluate the resulting array:

bindings = []
for binding in engine:
    binding_idx = engine.get_binding_index(binding)
    size = trt.volume(context.get_binding_shape(binding_idx))
    dtype = trt.nptype(engine.get_binding_dtype(binding))
    if engine.binding_is_input(binding):
        input_buffer = np.ascontiguousarray(batch)
        input_memory = cuda.mem_alloc(batch.nbytes)
        bindings.append(int(input_memory))
    else:
        output_buffer = np.empty([BATCH_SIZE, size], dtype)
        output_memory = cuda.mem_alloc(output_buffer.nbytes)
        bindings.append(int(output_memory))

stream = cuda.Stream()
# Transfer input data to the GPU.
cuda.memcpy_htod_async(input_memory, input_buffer, stream)
# Run inference
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer prediction output from the GPU.
cuda.memcpy_dtoh_async(output_buffer, output_memory, stream)
# Synchronize the stream
stream.synchronize()

However it seems only the first item was assigned correctly. Last three stayed 0.

In [24]: output_buffer[0][output_buffer[0] > 1]
Out[24]: array([15, 15, 15, ..., 15, 15, 15], dtype=int32)

In [25]: output_buffer[1][output_buffer[1] > 1]
Out[25]: array([], dtype=int32)

In [26]: output_buffer[2][output_buffer[2] > 1]
Out[26]: array([], dtype=int32)

In [27]: output_buffer[3][output_buffer[3] > 1]
Out[27]: array([], dtype=int32)

Hi @tzl,

Please refer the following doc to check about working with dynamic shape inputs and optimization profiles.
Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

Thank you.

Thank you. I solved the problem.

output_buffer = np.empty([BATCH_SIZE, size], dtype)

should have been

output_buffer = np.empty([BATCH_SIZE, size // BATCH_SIZE], dtype)
1 Like