How to inference with 2 different shape outputs

Description

I used the BERT demo in TensorRT github,here come the link

trt6.0

I got my bert model with 2 difference shape outputs, which i use network.mark_output() to make them become the output of engine one by one, it works and i successfully build the engine.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Saving Engine to bert_slot_384.engine
[TensorRT] INFO: Done.

But problem happens when i use inference.py to do inference, i didn’t changed any code of cuda and memory part, which comes with such error message,and if only one output it can do inference.

“”"
[TensorRT] ERROR: engine.cpp (165) - Cuda Error in ~ExecutionContext: 700 (an illegal memory access was
encountered)
[TensorRT] ERROR: INTERNAL_ERROR: std::exception
[TensorRT] ERROR: Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::165, condition:
cudaEventDestroy(context.start) failure.
[TensorRT] ERROR: Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::170, condition:
cudaEventDestroy(context.stop) failure.
[TensorRT] ERROR: …/rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 700 (an illegal memory access was
encountered)
terminate called after throwing an instance of ‘nvinfer1::CudaError’
what(): std::exception
Aborted (core dumped)
“”"

Environment

TensorRT Version: 6.0
GPU Type: 2080TI
Nvidia Driver Version: 418.39
CUDA Version: 10.1

Hi, Request you to share your model and script, so that we can help you better.

Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

Thanks!

# Import necessary plugins for BERT TensorRT
ctypes.CDLL("libnvinfer_plugin.so", mode=ctypes.RTLD_GLOBAL)
ctypes.CDLL("/workspace/TensorRT/demo/BERT/build/libcommon.so", mode=ctypes.RTLD_GLOBAL)
ctypes.CDLL("/workspace/TensorRT/demo/BERT/build/libbert_plugins.so", mode=ctypes.RTLD_GLOBAL)

# The first context created will use the 0th profile. A new context must be created
# for each additional profile needed. Here, we only use batch size 1, thus we only need the first profile.
with open(args.bert_engine, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime, \
    runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:

    # We always use batch size 1.
    input_shape = (1, max_seq_length)
    input_nbytes = trt.volume(input_shape) * trt.int32.itemsize

    # Allocate device memory for inputs.
    d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(3)]
    # Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()

    # Specify input shapes. These must be within the min/max bounds of the active profile (0th profile in this case)
    # Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.
    for binding in range(3):
        context.set_binding_shape(binding, input_shape)
    assert context.all_binding_shapes_specified

    # Allocate output buffer by querying the size from the context. This may be different for different input shapes.
    h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)
    d_output = cuda.mem_alloc(h_output.nbytes)

    def inference(features,doc_tokens,label):
        print("\nRunning Inference...")
        eval_start_time = time.time()

        # Copy inputs
        cuda.memcpy_htod_async(d_inputs[0], features["input_ids"], stream)
        cuda.memcpy_htod_async(d_inputs[1], features["segment_ids"], stream)
        cuda.memcpy_htod_async(d_inputs[2], features["input_mask"], stream)

        # Run inference
        context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], 
        stream_handle=stream.handle)
        # Transfer predictions back from GPU
        
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
        
        # Synchronize the stream
        stream.synchronize()
        predict_intents(h_output,label)
        
        
        eval_time_elapsed = time.time() - eval_start_time
        
        print("------------------------")
        print("Running inference in {:.3f} Sentences/Sec".format(1.0/eval_time_elapsed))
        print("------------------------")

here comes the code of get oputput from engine,just like the one in
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

thanks

Provided the code in my .py file. And i think that the code in inference.py was designed for one output, maybe we need to change it to apply for 2 different shape outputs. Do you have any experience or example, thanks a lot.

Hi @501967143,

We don’t have an example to provide.
As you already created one output, please create another output to just mimic d_output in the shared code.

Thank you.

I have created an open source, well documented project which demonstrates how you can run inference with single / multiple input - single / multiple output models with batching support in C++. It can be found here.