Model onnx trt engine generation process report different results compared between two PCs

orong13 · June 16, 2022, 6:05am

Description

Same model, but different PCs with different GPUs, SW SDKs (NVIDIA libraries and Python packages (Based on my understanding they are don’t care differences)).
On PC#1 I successfully generate the TRT engine and on the PC#2 I cannot.

Environment

PC#1:
TensorRT Version: 8.4.0.6
GPU Type: Quadro RTX 3000
Nvidia Driver Version: R516.01 (r515_95-3) / 31.0.15.1601 (4-24-2022)
CUDA Version: 11.7
CUDNN Version: 8.1.1
Operating System + Version: Windows 10
Python Version (if applicable): 3.6.8
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): NA
Baremetal or Container (if container which image + tag): Baremetal

PC#2:
TensorRT Version: 8.4.0.6
GPU Type: GeForce 3090
Nvidia Driver Version: R511.65(r511_37-13) / 30.0.15.1165 (1-28-2022)
CUDA Version: 11.4
CUDNN Version: 8.1.1
Operating System + Version: Windows 10
Python Version (if applicable): 3.6.8
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): NA
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

Attached are:
Model onnx
TRT engine generation report on the PC#1
Polygraphy report on the PC#1
TRT engine generation report on the PC#2
Polygraphy report on the PC#2

model_rand_weights_folded.onnx (2.8 MB)
trt_engine_3090_report.txt (630.8 KB)
Polygraphy_3090_report.txt (2.5 KB)
Polygraphy_3000_report.txt (2.0 KB)
trt_engine_3000_report.txt (574.5 KB)

Steps To Reproduce

Any basic TRT python logic (Based on TRT SDK Python samples) which load the onnx, use the builder\netwrok services after successfully pasre to generate the egnine file.

Some extra details:

The error is:
1: [convolutionRunner.cpp::nvinfer1::rt::task::CaskConvolutionRunner::onShapeChange::153] Error Code 1: Cask ( Failed to update runtime arguments.)

It easy to see that different CUDA kernels are tested and checked for the specific GPU for example:

3000 - Conv_220 Set Tactic Name: sm70_xmma_fprop_implicit_gemm_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x128x8_stage1_warpsize1x4x1_g1_ffma_aligna4_alignc4 Tactic: -2431551186657551688

3090 - Conv_220 Set Tactic Name:
ampere_scudnn_winograd_128x128_ldg1_ldg4_relu_tile442t_nt_v1 Tactic: -6664441261382767776

The folded model was generated on the PC#1 and copied to PC#2 for TRT engine generation process.

If I tried to fold the model on PC#2,
The following problem:
Polygraphy poblem

is emphasized and I’m getting a different folded model.

NVES · June 16, 2022, 6:37am

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

orong13 · June 17, 2022, 9:40pm

Hello,
Onnx model is attached.
I already generated the trt engine using my Python code so I think all trtexec verbose report can be found in the attached files.

Thanks,

spolisetty · June 20, 2022, 4:09pm

Hi,

Were you able to build the TensorRT engine without running the Polygraphy tool and using only trtexec, could you please share with us the original ONNX model.

Thank you.

orong13 · June 21, 2022, 11:45am

Hello,
The original onnx model cannot built to TRT engine.
I’m getting the following error:

> TRT - ERROR
[shuffleNode.cpp::nvinfer1::builder::ShuffleNode::symbolicExecute::392] Error Code 4: Internal Error (Reshape_66: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])
TRT - ERROR
ModelImporter.cpp:773: While parsing node number 76 [Pad → “143”]:
TRT - ERROR
ModelImporter.cpp:774: — Begin node —
TRT - ERROR
ModelImporter.cpp:775: input: “120”
input: “142”
output: “143”
name: “Pad_76”
op_type: “Pad”
attribute {

name: “mode”*
s: “reflect”*
type: STRING*
}
TRT - ERROR
ModelImporter.cpp:776: — End node —
TRT - ERROR
ModelImporter.cpp:779: ERROR: ModelImporter.cpp:180 In function parseGraph:
[6] Invalid Node - Pad_76
[shuffleNode.cpp::nvinfer1::builder::ShuffleNode::symbolicExecute::392] Error Code 4: Internal Error (Reshape_66: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])

Attached is the original onnx model:
model.onnx (3.1 MB)

In order to avoid this error, I shall first run the Polygraphy on it and only then it successfully built to TRT engine but only on PC#1.
When I’m using the same setup (Same onnx* packages versions) on PC#2 and copy the folded model from PC#1 to PC#2 I’m not able to build the TRT engine on PC#2.

If I tried to fold the original model on PC#2 the problem described here is rasied:
https://forums.developer.nvidia.com/t/two-machines-with-very-similar-sw-stack-but-different-gpus-generate-different-folded-model-using-the-polygraphy-tool-on-the-same-model-onnx-input/217850

That’s why I copy the folded model from PC#1 to PC#2.

Please adivse,

Thanks

spolisetty · June 24, 2022, 11:03am

Hi,

Hope this issue has been solved as part of

Thank you.

daniel60030 · June 26, 2022, 11:36am

No, we have solved the other problem but not the problem that mentioned here.
I created the onnx model, folded it using polygraphy and created the trt engine on my computer. After that I took the same code to the PC#2, created the onnx model, folded it and got the same problem when i tried to create the trt engine.

I checked the two onnx models before and after the polygraphy proccess and there was not any difference.
We have not solved the problem yet

spolisetty · July 5, 2022, 5:47am

Hi,

Sorry for the delayed response,
Could you please confirm at which stage are you facing an error on #PC2, is it while building the TensorRT Engine ?
Could you please upgrade TensorRT versions to latest 8.4 GA and try again.
https://developer.nvidia.com/nvidia-tensorrt-8x-download

If you still face this issue, please share with us trtexec --verbose logs, polygraphy logs, and folded model for better debugging.

Thank you.

daniel60030 · July 6, 2022, 6:56am

Hi, I tried what you wrote about upgrading the TensorRT version and the problem has been solved.
Thank you for your support.

system · July 20, 2022, 6:57am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Two machines with very similar SW stack but different GPUs generate different folded model using the Polygraphy tool on the same model onnx input TensorRT	7	797	June 22, 2022
LSTM ONNX to TensorRT mismatched outputs TensorRT tensorrt	3	916	September 29, 2022
Input length mismatch (onnx conversion to .trt) TensorRT tensorrt , onnx	4	1232	July 13, 2022
[TensorRT] ERROR: Network must have at least one output TensorRT tensorrt	29	2290	September 30, 2021
Convet onnx to trt engine got error TensorRT	3	1189	January 7, 2022
Onnx with dynamic batch cannot be parsed TensorRT tensorrt	12	1506	August 9, 2021
Problem converting TensorFlow 2-> ONNX model to TensorRT Engine (efficientdet_d0) TensorRT	8	1374	November 17, 2022
Error while building TensorRT OSS 8.0.1 TensorRT	29	3293	July 16, 2021
ONNX Plugin Layer implements TensorRT	11	1895	January 12, 2021
Tensorrt fails shapeMachine.cpp TensorRT tensorrt , cudnn	2	363	February 16, 2024

Model onnx trt engine generation process report different results compared between two PCs

Description

Environment

Relevant Files

Steps To Reproduce

check_model.py

Related topics