Model onnx trt engine generation process report different results compared between two PCs

Description

Same model, but different PCs with different GPUs, SW SDKs (NVIDIA libraries and Python packages (Based on my understanding they are don’t care differences)).
On PC#1 I successfully generate the TRT engine and on the PC#2 I cannot.

Environment

PC#1:
TensorRT Version: 8.4.0.6
GPU Type: Quadro RTX 3000
Nvidia Driver Version: R516.01 (r515_95-3) / 31.0.15.1601 (4-24-2022)
CUDA Version: 11.7
CUDNN Version: 8.1.1
Operating System + Version: Windows 10
Python Version (if applicable): 3.6.8
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): NA
Baremetal or Container (if container which image + tag): Baremetal

PC#2:
TensorRT Version: 8.4.0.6
GPU Type: GeForce 3090
Nvidia Driver Version: R511.65(r511_37-13) / 30.0.15.1165 (1-28-2022)
CUDA Version: 11.4
CUDNN Version: 8.1.1
Operating System + Version: Windows 10
Python Version (if applicable): 3.6.8
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): NA
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

Attached are:
Model onnx
TRT engine generation report on the PC#1
Polygraphy report on the PC#1
TRT engine generation report on the PC#2
Polygraphy report on the PC#2

model_rand_weights_folded.onnx (2.8 MB)
trt_engine_3090_report.txt (630.8 KB)
Polygraphy_3090_report.txt (2.5 KB)
Polygraphy_3000_report.txt (2.0 KB)
trt_engine_3000_report.txt (574.5 KB)

Steps To Reproduce

Any basic TRT python logic (Based on TRT SDK Python samples) which load the onnx, use the builder\netwrok services after successfully pasre to generate the egnine file.

Some extra details:

The error is:
1: [convolutionRunner.cpp::nvinfer1::rt::task::CaskConvolutionRunner::onShapeChange::153] Error Code 1: Cask ( Failed to update runtime arguments.)

It easy to see that different CUDA kernels are tested and checked for the specific GPU for example:

3000 - Conv_220 Set Tactic Name: sm70_xmma_fprop_implicit_gemm_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x128x8_stage1_warpsize1x4x1_g1_ffma_aligna4_alignc4 Tactic: -2431551186657551688

3090 - Conv_220 Set Tactic Name:
ampere_scudnn_winograd_128x128_ldg1_ldg4_relu_tile442t_nt_v1 Tactic: -6664441261382767776

The folded model was generated on the PC#1 and copied to PC#2 for TRT engine generation process.

If I tried to fold the model on PC#2,
The following problem:
Polygraphy poblem

is emphasized and I’m getting a different folded model.

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

Hello,
Onnx model is attached.
I already generated the trt engine using my Python code so I think all trtexec verbose report can be found in the attached files.

Thanks,

Hi,

Were you able to build the TensorRT engine without running the Polygraphy tool and using only trtexec, could you please share with us the original ONNX model.

Thank you.

Hello,
The original onnx model cannot built to TRT engine.
I’m getting the following error:

> TRT - ERROR
[shuffleNode.cpp::nvinfer1::builder::ShuffleNode::symbolicExecute::392] Error Code 4: Internal Error (Reshape_66: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])
TRT - ERROR
ModelImporter.cpp:773: While parsing node number 76 [Pad → “143”]:
TRT - ERROR
ModelImporter.cpp:774: — Begin node —
TRT - ERROR
ModelImporter.cpp:775: input: “120”
input: “142”
output: “143”
name: “Pad_76”
op_type: “Pad”
attribute {

  • name: “mode”*
  • s: “reflect”*
  • type: STRING*
    }
    TRT - ERROR
    ModelImporter.cpp:776: — End node —
    TRT - ERROR
    ModelImporter.cpp:779: ERROR: ModelImporter.cpp:180 In function parseGraph:
    [6] Invalid Node - Pad_76
    [shuffleNode.cpp::nvinfer1::builder::ShuffleNode::symbolicExecute::392] Error Code 4: Internal Error (Reshape_66: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])

Attached is the original onnx model:
model.onnx (3.1 MB)

In order to avoid this error, I shall first run the Polygraphy on it and only then it successfully built to TRT engine but only on PC#1.
When I’m using the same setup (Same onnx* packages versions) on PC#2 and copy the folded model from PC#1 to PC#2 I’m not able to build the TRT engine on PC#2.

If I tried to fold the original model on PC#2 the problem described here is rasied:
https://forums.developer.nvidia.com/t/two-machines-with-very-similar-sw-stack-but-different-gpus-generate-different-folded-model-using-the-polygraphy-tool-on-the-same-model-onnx-input/217850

That’s why I copy the folded model from PC#1 to PC#2.

Please adivse,

Thanks

Hi,

Hope this issue has been solved as part of

Thank you.

No, we have solved the other problem but not the problem that mentioned here.
I created the onnx model, folded it using polygraphy and created the trt engine on my computer. After that I took the same code to the PC#2, created the onnx model, folded it and got the same problem when i tried to create the trt engine.

I checked the two onnx models before and after the polygraphy proccess and there was not any difference.
We have not solved the problem yet