I got my bert model with 2 difference shape outputs, which i use network.mark_output() to make them become the output of engine one by one, it works and i successfully build the engine.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Detected 3 inputs and 2 output network tensors.
[TensorRT] INFO: Saving Engine to bert_slot_384.engine
[TensorRT] INFO: Done.
But problem happens when i use inference.py to do inference, i didn’t changed any code of cuda and memory part, which comes with such error message,and if only one output it can do inference.
“”"
[TensorRT] ERROR: engine.cpp (165) - Cuda Error in ~ExecutionContext: 700 (an illegal memory access was
encountered)
[TensorRT] ERROR: INTERNAL_ERROR: std::exception
[TensorRT] ERROR: Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::165, condition:
cudaEventDestroy(context.start) failure.
[TensorRT] ERROR: Parameter check failed at: …/rtSafe/safeContext.cpp::terminateCommonContext::170, condition:
cudaEventDestroy(context.stop) failure.
[TensorRT] ERROR: …/rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 700 (an illegal memory access was
encountered)
terminate called after throwing an instance of ‘nvinfer1::CudaError’
what(): std::exception
Aborted (core dumped)
“”"
# Import necessary plugins for BERT TensorRT
ctypes.CDLL("libnvinfer_plugin.so", mode=ctypes.RTLD_GLOBAL)
ctypes.CDLL("/workspace/TensorRT/demo/BERT/build/libcommon.so", mode=ctypes.RTLD_GLOBAL)
ctypes.CDLL("/workspace/TensorRT/demo/BERT/build/libbert_plugins.so", mode=ctypes.RTLD_GLOBAL)
# The first context created will use the 0th profile. A new context must be created
# for each additional profile needed. Here, we only use batch size 1, thus we only need the first profile.
with open(args.bert_engine, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime, \
runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
# We always use batch size 1.
input_shape = (1, max_seq_length)
input_nbytes = trt.volume(input_shape) * trt.int32.itemsize
# Allocate device memory for inputs.
d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(3)]
# Create a stream in which to copy inputs/outputs and run inference.
stream = cuda.Stream()
# Specify input shapes. These must be within the min/max bounds of the active profile (0th profile in this case)
# Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.
for binding in range(3):
context.set_binding_shape(binding, input_shape)
assert context.all_binding_shapes_specified
# Allocate output buffer by querying the size from the context. This may be different for different input shapes.
h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)
d_output = cuda.mem_alloc(h_output.nbytes)
def inference(features,doc_tokens,label):
print("\nRunning Inference...")
eval_start_time = time.time()
# Copy inputs
cuda.memcpy_htod_async(d_inputs[0], features["input_ids"], stream)
cuda.memcpy_htod_async(d_inputs[1], features["segment_ids"], stream)
cuda.memcpy_htod_async(d_inputs[2], features["input_mask"], stream)
# Run inference
context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)],
stream_handle=stream.handle)
# Transfer predictions back from GPU
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream
stream.synchronize()
predict_intents(h_output,label)
eval_time_elapsed = time.time() - eval_start_time
print("------------------------")
print("Running inference in {:.3f} Sentences/Sec".format(1.0/eval_time_elapsed))
print("------------------------")
Provided the code in my .py file. And i think that the code in inference.py was designed for one output, maybe we need to change it to apply for 2 different shape outputs. Do you have any experience or example, thanks a lot.
I have created an open source, well documented project which demonstrates how you can run inference with single / multiple input - single / multiple output models with batching support in C++. It can be found here.