I can't get result from TensorRT model


I tried to convert the GPT model from pytorch to onnx and then to tensorRT, I successfully converted to tensorRT engine, but I can’t get the results I want during the inference phase, I can guarantee that the onnx model is correct. These two warnings appeared in the process of converting the onnx model to the tensorRT engine. I don’t know if these two warnings will affect the engine conversion.

[05/29/2022-19:08:00] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[05/29/2022-19:08:01] [TRT] [W] ShapedWeights.cpp:173: Weights transformer.h.8.attn.c_attn.weight has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.

The code that onnx converts to tensorRT:

import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)

builder = trt.Builder(logger)

network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

parser = trt.OnnxParser(network, logger)

success = parser.parse_from_file('model.onnx')
# for idx in range(parser.num_errors):
#     print(parser.get_error(idx))

if not success:
    pass # Error handling code here

config = builder.create_builder_config()
#config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20) # 1 MiB
config.max_workspace_size = 1 << 31

profile = builder.create_optimization_profile()  
profile.set_shape("input_ids", (1, 1), (1, 20), (1, 300))
profile.set_shape("token_type_ids", (1, 1), (1, 20), (1, 300))

serialized_engine = builder.build_serialized_network(network, config)
with open("sample4.engine", "wb") as f:

The main code to inference, input_ids and token_type_ids is two input for the model.

context.active_optimization_profile = 0
origin_inputshape = context.get_binding_shape(0)
origin_inputshape[0],origin_inputshape[1] = input_ids.shape

inputs, outputs, bindings, stream = common.allocate_buffers(engine)
inputs[1].host = input_ids
inputs[0].host = token_type_ids

logits, *_= common.do_inference_v2(context,bindings = bindings, inputs= inputs, outputs=outputs, stream = stream)

the model I want to convert is OpenAIGPTLMHeadModel, I can only put one link, but you can cheack it from huggingface


TensorRT Version:
GPU Type: RTX 3060
Nvidia Driver Version: 497.38
CUDA Version: 11.5.1
CUDNN Version:
Operating System + Version: Windows11
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11
Baremetal or Container (if container which image + tag):

Relevant Files

github link to my code
RuntensorRT is inference phase

Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet


import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging

1.Validation results

2.I try run trtexec with ‘./trtexec --onnx=D:\Subject\dialogue\CDial-GPT\model.onnx --saveEngine=D:\Subject\dialogue\CDial-GPT\sample.engine --fp16 --workspace=10000 --minShapes=input_ids:1x1,token_type_ids:1x1 --optShapes=input_ids:1x300,token_type_ids:1x300 --maxShapes=input_ids:1x300,token_type_ids:1x300 --device=0 --verbose --exportTimes=trace.json’
here are all the logs I can get
logs.txt (683.1 KB)

The onnx file is too big to upload,I am uploading the onnx model to google drive, can I have your email so I can share with you, or you have a more convenient way.

It’s trace.josn
trace.json (169.2 KB)

I have reply blow my question, please check.

The result is no longer 0, but the dimension is still wrong, the correct dimension is 3.

The problem seems to be in allocate_buffers(engine):, I changed the size to quantitative before, because the size obtained by trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size is a negative number, so in host_mem = cuda.pagelocked_empty(size, dtype), the error is ‘pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory’ , how can I solve this problem?

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream


The above error is related to dimensions, maybe you’re not handling the dynamic shape correctly.
Could you use context.get_binding_shape correctly for the engine with dynamic shape.

Please share with us issue repro script and model to try from our end if you still face this issue.

Thank you.