InstanceNormalization produces nan,inf on TensorRT by my model

Description

Hi NVidia Team,

I converted the .onnx to TensorRT.
Converted engine generates nan,inf.
.onnx files do not generate these.

After inspecting the output, I found that it is produced when running the InstanceNormalization op.
Is this a weight issue?

I uploaded the .onnx that cuts out only the part that executes InstanceNormalization.

Thank you.

Environment

GPU Type: GeForce RTX 2060 SUPER
Nvidia Driver Version: 535.98
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/pytorch:23.05-py3

Relevant Files

Steps To Reproduce

build environment

docker compose up

onnx test

python test-onnx.py

trtexec test

trtexec --onnx=repro.onnx --fp16 --dumpOutput

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

Hello.

I shared .onnx in relevant files repo:

Test scripts is also contains in this repo.
Please check these.

Thank you.

I checked repro.onnx. It produces error.

import onnx
import onnx.checker

model = onnx.load("repro.onnx")
print(model.ir_version) # print "9"
onnx.checker.check_model(model) # onnx.onnx_cpp2py_export.checker.ValidationError: Your model ir_version is higher than the checker's.

Certainly TerrorRT itself supports up to 7 opsets.
On the other hand, it was stated that onnx_tensorrt supports up to 17, so if the model is converted by onnx_tensorrt, there seems to be no problem.
The imported onnx is the one included in the container.


I also tried running trtexec with --verbose, but I don’t see anything suspicious, except the output.
verbose.log (2.0 MB)


I hope the issue is resolved.
Thank you.

Hello.

I tried to implement repro.onnx’s InstanceNormalization by other ops.

  • Converted onnx produces close results from repro.onnx .
  • trtexec not produces inf/nan. (!)

Converted model is here.

So, I guess, InstanceNormalization_TRT plugin has issue of implements.
According to the onnx-tensorrt repo, it is used on InstanceNormalization parsing.
Maybe it has something to do with having two versions of InstanceNormalization_TRT loaded.
You can see this in the verbose.log uploaded in the previous post.

Hope it helps you solve it.
Thank you.

Hi,

Sorry for the delayed response.
Are you still facing the issue

Hello.

I still get around this with a workaround that replaces InstanceNormalization.
Since the number of nodes on onnx will increase, if possible, I am looking for a method that does not produce nan or inf while using InstanceNormalization.

Thank you.

Hi,

We could reproduce similar behavior. Please allow us some to work on a fix.

Thank you.

Hi,

As a workaround, we recommend that you please use the Native instance norm.
Please use --onnx-flags native_instancenorm in the Polygraphy command.

[I]             Relative Difference | Stats: mean=5.9293e-07, std-dev=2.1721e-05, var=4.7182e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0026144 at (1, 6, 7, 24), avg-magnitude=5.9293e-07
[V]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 0.000261) |     261922 | ########################################
                    (0.000261, 0.000523) |         31 |
                    (0.000523, 0.000784) |        144 |
                    (0.000784, 0.00105 ) |         38 |
                    (0.00105 , 0.00131 ) |          3 |
                    (0.00131 , 0.00157 ) |          1 |
                    (0.00157 , 0.00183 ) |          3 |
                    (0.00183 , 0.00209 ) |          0 |
                    (0.00209 , 0.00235 ) |          0 |
                    (0.00235 , 0.00261 ) |          2 |
[I]         PASSED | Output: '/down_blocks.0/resnets.0/norm1/InstanceNormalization_output_0' | Difference is within tolerance (rel=0.001, abs=0.001)
[I]     PASSED | All outputs matched | Outputs: ['/down_blocks.0/resnets.0/norm1/InstanceNormalization_output_0']
[I] Accuracy Summary | trt-runner-N0-07/10/23-16:26:48 vs. onnxrt-runner-N0-07/10/23-16:26:48 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 9.875s | Command: /usr/local/bin/polygraphy run repro.onnx --trt --onnxrt --pool-limit workspace:5G --verbose --atol 0.001 --rtol 0.001 --check-error-stat median --onnx-flags native_instancenorm

The following may also be helpful to you:

Thank you.