Description
I tried to convert the GPT model from pytorch to onnx and then to tensorRT, I successfully converted to tensorRT engine, but I can’t get the results I want during the inference phase, I can guarantee that the onnx model is correct. These two warnings appeared in the process of converting the onnx model to the tensorRT engine. I don’t know if these two warnings will affect the engine conversion.
[05/29/2022-19:08:00] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[05/29/2022-19:08:01] [TRT] [W] ShapedWeights.cpp:173: Weights transformer.h.8.attn.c_attn.weight has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
The code that onnx converts to tensorRT:
import tensorrt as trt
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
success = parser.parse_from_file('model.onnx')
# for idx in range(parser.num_errors):
# print(parser.get_error(idx))
if not success:
pass # Error handling code here
config = builder.create_builder_config()
#config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20) # 1 MiB
config.max_workspace_size = 1 << 31
profile = builder.create_optimization_profile()
profile.set_shape("input_ids", (1, 1), (1, 20), (1, 300))
profile.set_shape("token_type_ids", (1, 1), (1, 20), (1, 300))
config.add_optimization_profile(profile)
serialized_engine = builder.build_serialized_network(network, config)
with open("sample4.engine", "wb") as f:
f.write(serialized_engine)
The main code to inference, input_ids and token_type_ids is two input for the model.
context.active_optimization_profile = 0
origin_inputshape = context.get_binding_shape(0)
origin_inputshape[0],origin_inputshape[1] = input_ids.shape
context.set_binding_shape(0,(origin_inputshape))
context.set_binding_shape(1,(origin_inputshape))
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
inputs[1].host = input_ids
inputs[0].host = token_type_ids
logits, *_= common.do_inference_v2(context,bindings = bindings, inputs= inputs, outputs=outputs, stream = stream)
the model I want to convert is OpenAIGPTLMHeadModel, I can only put one link, but you can cheack it from huggingface
Environment
TensorRT Version: 8.2.5.1
GPU Type: RTX 3060
Nvidia Driver Version: 497.38
CUDA Version: 11.5.1
CUDNN Version: 8.2.1.32
Operating System + Version: Windows11
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11
Baremetal or Container (if container which image + tag):
Relevant Files
github link to my code
RuntensorRT is inference phase