ONNX model and TensorRT engine works differently


I am trying to export ONNX BERT model to TensorRT engine.

ONNX model works well but the converted TensorRT engine works like untrained model.

I think either the export process or TensorRT inference caused this problem.

My export & inference code is below.

Is there something wrong with my code?


TensorRT Version:
Nvidia Driver Version: 470.141.03
CUDA Version: 11.4
CUDNN Version:
Operating System + Version: Ubuntu 20.04.3 LTS
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

TRT_LOGGER = trt.Logger(trt.Logger.INFO)
explicit_batch_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

with trt.Builder(TRT_LOGGER) as builder, builder.create_network(explicit_batch_flag) as network,\
    builder.create_builder_config() as builder_config:
    parser = trt.OnnxParser(network, TRT_LOGGER)
    onnx_path = "./bert.onnx"
    with open(onnx_path, "rb") as f:
        if not parser.parse(f.read()):
            print(f"Failed to load ONNX file: {onnx_path}")
            for error in range(parser.num_errors):
    inputs = [network.get_input(i) for i in range(network.num_inputs)]
    profile = builder.create_optimization_profile()
    min_shape = (1, 128)
    opt_shape = (8, 128)
    max_shape = (16, 128)
    profile.set_shape(inputs[0].name, min_shape, opt_shape, max_shape)
    profile.set_shape(inputs[1].name, min_shape, opt_shape, max_shape)
    profile.set_shape(inputs[2].name, min_shape, opt_shape, max_shape)


    engine = builder.build_engine(network, builder_config)
    serialized_engine = engine.serialize()
    with open("./bert.engine", "wb") as fout:

with open("./bert.engine", "rb") as f, \
    trt.Runtime(TRT_LOGGER) as runtime, \
    runtime.deserialize_cuda_engine(f.read()) as engine, \
    engine.create_execution_context() as context:
    input_shape = (1, max_seq_length)
    input_nbytes = trt.volume(input_shape) * trt.int32.itemsize
    d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(3)]
    stream = cuda.Stream()

    for binding in range(3):
        context.set_binding_shape(binding, input_shape)
    assert context.all_binding_shapes_specified

    h_output1 = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)
    h_output2 = cuda.pagelocked_empty(tuple(context.get_binding_shape(4)), dtype=np.float32)
    d_output1 = cuda.mem_alloc(h_output1.nbytes)
    d_output2 = cuda.mem_alloc(h_output2.nbytes)

    for feature_index, feature in enumerate(features):
        input_ids = cuda.register_host_memory(np.ascontiguousarray(feature.input_ids.ravel()))
        segment_ids = cuda.register_host_memory(np.ascontiguousarray(feature.segment_ids.ravel()))
        input_mask = cuda.register_host_memory(np.ascontiguousarray(feature.input_mask.ravel()))

        eval_start_time = time.time()
        cuda.memcpy_htod_async(d_inputs[0], input_ids, stream)
        cuda.memcpy_htod_async(d_inputs[1], segment_ids, stream)
        cuda.memcpy_htod_async(d_inputs[2], input_mask, stream)

        context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output1), int(d_output2)], stream_handle=stream.handle)
        eval_time_elapsed += (time.time() - eval_start_time)

        cuda.memcpy_dtoh_async(h_output1, d_output1, stream)
        cuda.memcpy_dtoh_async(h_output2, d_output2, stream)


Could you please share the issue repro ONNX model and script with us for better debugging?

Thank you.

I tested the above code with several models, and one of them was downloaded from inference/language/bert at master · mlcommons/inference · GitHub

(With the model, *_shape should be changed with (*, 384))

Thank you.

I share the build log and inference log, too.

build log

[02/01/2023-08:14:50] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +413, GPU +132, now: CPU 1357, GPU 767 (MiB)
[02/01/2023-08:14:51] [TRT] [W] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Input 'input_ids' with shape (-1, 128) and dtype DataType.INT32
Input 'input_mask' with shape (-1, 128) and dtype DataType.INT32
Input 'segment_ids' with shape (-1, 128) and dtype DataType.INT32
Output 'output_start_logits' with shape (-1, 128) and dtype DataType.FLOAT
Output 'output_end_logits' with shape (-1, 128) and dtype DataType.FLOAT
/tmp/ipykernel_190824/3262529163.py:37: DeprecationWarning: Use build_serialized_network instead.
  engine = builder.build_engine(network, builder_config)
[02/01/2023-08:14:55] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +634, GPU +266, now: CPU 2414, GPU 1033 (MiB)
[02/01/2023-08:14:56] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +358, GPU +260, now: CPU 2772, GPU 1293 (MiB)
[02/01/2023-08:14:56] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.0.5
[02/01/2023-08:14:56] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[02/01/2023-08:16:15] [TRT] [I] Detected 3 inputs and 2 output network tensors.
[02/01/2023-08:16:17] [TRT] [I] Total Host Persistent Memory: 160
[02/01/2023-08:16:17] [TRT] [I] Total Device Persistent Memory: 0
[02/01/2023-08:16:17] [TRT] [I] Total Scratch Memory: 81821696
[02/01/2023-08:16:17] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 0 MiB
[02/01/2023-08:16:17] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.004787ms to assign 2 blocks to 2 nodes requiring 81822208 bytes.
[02/01/2023-08:16:17] [TRT] [I] Total Activation Memory: 81822208
[02/01/2023-08:16:17] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[02/01/2023-08:16:17] [TRT] [I] build engine in 84.318 Sec
[02/01/2023-08:16:17] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[02/01/2023-08:16:18] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.

inference log

[02/01/2023-08:16:19] [TRT] [I] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.

[02/01/2023-08:16:19] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3605, GPU 1787 (MiB)
[02/01/2023-08:16:19] [TRT] [I] Loaded engine size: 417 MiB
[02/01/2023-08:16:20] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)

Running Inference...
[02/01/2023-08:16:20] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)

Oh, I solved the problem.

The order of inputs in TensorRT inference and that in the ONNX model are different, but I didn’t consider them.

Sorry for my carelessness.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.