How to inference with 2 different shape outputs

501967143 · January 8, 2021, 2:20am

Description

I used the BERT demo in TensorRT github,here come the link

I got my bert model with 2 difference shape outputs, which i use network.mark_output() to make them become the output of engine one by one, it works and i successfully build the engine.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Saving Engine to bert_slot_384.engine
[TensorRT] INFO: Done.

But problem happens when i use inference.py to do inference, i didn’t changed any code of cuda and memory part, which comes with such error message,and if only one output it can do inference.

“”"
[TensorRT] ERROR: engine.cpp (165) - Cuda Error in ~ExecutionContext: 700 (an illegal memory access was
encountered)
[TensorRT] ERROR: INTERNAL_ERROR: std::exception
[TensorRT] ERROR: Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::165, condition:
cudaEventDestroy(context.start) failure.
[TensorRT] ERROR: Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::170, condition:
cudaEventDestroy(context.stop) failure.
[TensorRT] ERROR: …/rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 700 (an illegal memory access was
encountered)
terminate called after throwing an instance of ‘nvinfer1::CudaError’
what(): std::exception
Aborted (core dumped)
“”"

Environment

TensorRT Version: 6.0
GPU Type: 2080TI
Nvidia Driver Version: 418.39
CUDA Version: 10.1

NVES · January 8, 2021, 2:37am

Hi, Request you to share your model and script, so that we can help you better.

Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

Thanks!

501967143 · January 8, 2021, 2:53am

# Import necessary plugins for BERT TensorRT
ctypes.CDLL("libnvinfer_plugin.so", mode=ctypes.RTLD_GLOBAL)
ctypes.CDLL("/workspace/TensorRT/demo/BERT/build/libcommon.so", mode=ctypes.RTLD_GLOBAL)
ctypes.CDLL("/workspace/TensorRT/demo/BERT/build/libbert_plugins.so", mode=ctypes.RTLD_GLOBAL)

# The first context created will use the 0th profile. A new context must be created
# for each additional profile needed. Here, we only use batch size 1, thus we only need the first profile.
with open(args.bert_engine, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime, \
    runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:

    # We always use batch size 1.
    input_shape = (1, max_seq_length)
    input_nbytes = trt.volume(input_shape) * trt.int32.itemsize

    # Allocate device memory for inputs.
    d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(3)]
    # Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()

    # Specify input shapes. These must be within the min/max bounds of the active profile (0th profile in this case)
    # Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.
    for binding in range(3):
        context.set_binding_shape(binding, input_shape)
    assert context.all_binding_shapes_specified

    # Allocate output buffer by querying the size from the context. This may be different for different input shapes.
    h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)
    d_output = cuda.mem_alloc(h_output.nbytes)

    def inference(features,doc_tokens,label):
        print("\nRunning Inference...")
        eval_start_time = time.time()

        # Copy inputs
        cuda.memcpy_htod_async(d_inputs[0], features["input_ids"], stream)
        cuda.memcpy_htod_async(d_inputs[1], features["segment_ids"], stream)
        cuda.memcpy_htod_async(d_inputs[2], features["input_mask"], stream)

        # Run inference
        context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], 
        stream_handle=stream.handle)
        # Transfer predictions back from GPU
        
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
        
        # Synchronize the stream
        stream.synchronize()
        predict_intents(h_output,label)
        
        
        eval_time_elapsed = time.time() - eval_start_time
        
        print("------------------------")
        print("Running inference in {:.3f} Sentences/Sec".format(1.0/eval_time_elapsed))
        print("------------------------")

here comes the code of get oputput from engine,just like the one in
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

thanks

501967143 · January 8, 2021, 3:02am

Provided the code in my .py file. And i think that the code in inference.py was designed for one output, maybe we need to change it to apply for 2 different shape outputs. Do you have any experience or example, thanks a lot.

spolisetty · January 12, 2021, 9:02am

Hi @501967143,

We don’t have an example to provide.
As you already created one output, please create another output to just mimic d_output in the shared code.

Thank you.

cyruspk4w6 · April 17, 2023, 4:29pm

I have created an open source, well documented project which demonstrates how you can run inference with single / multiple input - single / multiple output models with batching support in C++. It can be found here.

Topic		Replies	Views
Run inference on a batch of images & parallel inference using cuda on python threads TensorRT tensorrt , cuda	6	2291	January 6, 2022
Can TensorRT do inference in a child thread ? TensorRT	6	2203	August 11, 2020
../rtSafe/cuda/cudaConvolutionRunner.cpp (483) - Cudnn Error in executeConv: 3 (CUDNN_STATUS_BAD_PARAM) TensorRT	3	710	November 2, 2022
[TensorRT] engine happed a error in multithreaded TensorRT tensorrt , cuda	2	1556	January 19, 2023
[TensorRT] Converted ONNX model inference error TensorRT tensorrt	5	2082	October 12, 2021
Falure to do inference TAO Toolkit tensorrt	9	1071	January 11, 2022
Unable to do inference of multiple engines in parallel TensorRT tensorrt , nano	3	1721	May 6, 2022
Different TensorRT inference results from the same input when batchSize > 1 TensorRT	2	2029	October 12, 2021
Cuda Error in launchPwgenKernel- When running a specific engine in async TensorRT tensorrt	9	2156	June 11, 2022
TensorRT inference custom model Unet Jetson AGX Xavier tensorrt	5	944	October 18, 2021

How to inference with 2 different shape outputs

Description

Environment

Related topics