Pytorch to Onnx to TRT: Unstable Output when running TRTExec

I am currently developing a Pytorch Model which I am exporting to onnx and running with TensorRT. I have currently been running into issues where the output of the model seems to be unstable between runs (where I load the model from TRT between each run).

This effect also seems to be occuring seemingly at random. I have used TRTExec to load the model multiple times and run on the same input; sometimes the outputs match (to within 1e-10) and other times I seem to get more significant differences between the outputs

Environment

Pytorch 1.4.5
Onnx 1.6.0
TensorRT 7.1.3
AGX Xavier with Jetpack (4.4.1)

I have also observed this issue on a x86 desktop with a 1080TI

Overview

I have written a script to test the stability of the model output between runs:

This script just calls trtexec twice and compares the output tensor pixel by pixel (with an epsilon of 1e-10)

/usr/src/tensorrt/bin/trtexec --fp16 --workspace=64 --iterations=1 --warmUp=0 --duration=0 --onnx=/path/to/onnx --exportOutput=/tmp/tmpp7x9g4e0.json

For some of my experiments with smaller models , I see that there are small errors (seem to be ~0.001).

Index[0]: Run 1(2.07422) vs Run 2(2.07517)
Index[1]: Run 1(2.64453) vs Run 2(2.64456)
Index[2]: Run 1(-0.204102) vs Run 2(-0.208065)
Index[3]: Run 1(1.40332) vs Run 2(1.40317)
Index[4]: Run 1(-0.982422) vs Run 2(-0.982411)
Index[5]: Run 1(-0.0322266) vs Run 2(-0.0332114)
Index[6]: Run 1(-0.289062) vs Run 2(-0.287776)
Index[7]: Run 1(4.5) vs Run 2(4.50111)

However, sometimes this model produces the exact same results between each run

When I run models with more layers, I more consistently see this issue. In addition, I start to see higher errors (for example, I’ve seen an output of our production model differ by 0.3).

Basically I am trying to understand the following:

  1. Can an onnx model produce different outputs on the same input between different runs? Is there some precision error with FP16 that has a random nature to it?

  2. Is there an issue with the onnx model that is causing is to randomly produce different values between iterations


    I am only working with a few Conv, Batch Norm, and Clip Layers and I can see this error

  3. Is serialization the only way to guarantee the same performance over different runs? If so, is there a way to make sure that the serialized model matches the performance of the original onnx / pytorch model (w.r.t when the same model produces the same output within 1e-10)

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

Here are the verbose outputs from 2 runs of TRT exec where the outputs of the same model did not match:
run1.txt (93.0 KB)
run2.txt (92.5 KB)

This is urgent for us. Is there additional debugging information which would assist here?

Hi,

We tried running script you’ve shared using onnx model on latest TRT version 8.0.1, without DLA, Ubuntu machine.
We are unable to reproduce the problem. Got following output from your script.
------------------ Validation Passed -------------------------

Based on the info you provided, it looks like very little difference. Not sure it would be a problem. May be you need to wait till latest TRT release for jetson or you can use NGC container. NVIDIA NGC

If you need further assistance, we recommend you to post your concern on Jetson related forum.

Thank you.

Hi there,

We are currently not able to wait to integrate TRT 8.0 into our product. To clarify, have you been able to observe the issue I listed when running with TensorRT 7.1.3? I have observed this issue on my x86 Ubuntu 18.0.4 machine as well as the Xavier.

Is there an x86 version of trtexec that I can use to validate and serialize models on my desktop?

Hi @VivekKrishnan,

We have tried on TensorRT 7.1.3 as well using nvcr.io/nvidia/tensorrt:20.09-py3, but unable to reproduce the issue.

&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --fp16 --workspace=64 --iterations=1 --warmUp=0 --duration=0 --verbose --dumpProfile --onnx=./model.onnx --exportOutput=/tmp/tmpelu7pwpl.json
------------------ Validation Passed -------------------------

Thank you.

I appreciate that you have looked into this issue. I am currently still observing the issue on the machines at my office. I have tried upgrading to TensorRT 8.0.1 with CUDA 10.2, but I still observed the output fluctuating randomly between uses.

While most of the outputs have a small variance (~0.001), I have observed that the output corresponding to confidence can vary by a much higher variance (0.5-1 off) for positive samples.

Is there any system level information I can provide (i.e. dmesg logs, tensorrt verbose logs, system settings) to provide more information for debugging. I have also tried running the mobilenetv2-7 model available here: models/mobilenetv2-7.onnx at main · onnx/models · GitHub. I do not observe this issue with the mobilenet model.

I observed that our model is using batch normalization while the mobilenet model does not seem to have any batch normalization layer. Is there any known issues when working with batch normalization errors (Pytorch → Onnx → tensorrt workflow).

Thanks

@VivekKrishnan,

Can you please share latest TRT version Verbose logs. and your script output for better debugging.

Thank you.

For the tensorRT 8.0.1 test, I was just using TRTExec and loading the outputs. I have taken some of the tensorRT verbose logs from running the model and I see the following when I diff the model loading logs (where the first file is from a model that has a higher confidence, and the second is from a model where the unstable output seems to be causing a much lower confidence output)

< Layer(DepthwiseConvolution): (Unnamed Layer* 11) [Convolution] + (Unnamed Layer* 13) [Activation], Tactic: -1, 334[Float(96,112,112)] -> 337[Float(96,56,56)]
---
> Layer(Convolution): (Unnamed Layer* 11) [Convolution] + (Unnamed Layer* 13) [Activation], Tactic: 57, 334[Float(96,112,112)] -> 337[Float(96,56,56)]
23c22
< Layer(FusedConvActDirect): (Unnamed Layer* 59) [Convolution] + (Unnamed Layer* 61) [Activation], Tactic: 2097151, 382[Float(64,14,14)] -> 385[Float(384,14,14)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 59) [Convolution] + (Unnamed Layer* 61) [Activation], Tactic: 5177343, 382[Float(64,14,14)] -> 385[Float(384,14,14)]
26c25
< Layer(FusedConvActDirect): (Unnamed Layer* 68) [Convolution] + (Unnamed Layer* 70) [Activation], Tactic: 2097151, 391[Float(64,14,14)] -> 394[Float(384,14,14)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 68) [Convolution] + (Unnamed Layer* 70) [Activation], Tactic: 5177343, 391[Float(64,14,14)] -> 394[Float(384,14,14)]
29c28
< Layer(FusedConvActDirect): (Unnamed Layer* 77) [Convolution] + (Unnamed Layer* 79) [Activation], Tactic: 2097151, 400[Float(64,14,14)] -> 403[Float(384,14,14)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 77) [Convolution] + (Unnamed Layer* 79) [Activation], Tactic: 5177343, 400[Float(64,14,14)] -> 403[Float(384,14,14)]
32c31
< Layer(FusedConvActDirect): (Unnamed Layer* 86) [Convolution] + (Unnamed Layer* 88) [Activation], Tactic: 2097151, 409[Float(64,14,14)] -> 412[Float(384,14,14)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 86) [Convolution] + (Unnamed Layer* 88) [Activation], Tactic: 5177343, 409[Float(64,14,14)] -> 412[Float(384,14,14)]
35c34
< Layer(FusedConvActDirect): (Unnamed Layer* 94) [Convolution] + (Unnamed Layer* 96) [Activation], Tactic: 5898239, 417[Float(96,14,14)] -> 420[Float(576,14,14)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 94) [Convolution] + (Unnamed Layer* 96) [Activation], Tactic: 6291455, 417[Float(96,14,14)] -> 420[Float(576,14,14)]
37,38c36,37
< Layer(Convolution): (Unnamed Layer* 100) [Convolution] + (Unnamed Layer* 102) [ElementWise], Tactic: 0, 423[Float(576,14,14)], 417[Float(96,14,14)] -> 426[Float(96,14,14)]
< Layer(FusedConvActDirect): (Unnamed Layer* 103) [Convolution] + (Unnamed Layer* 105) [Activation], Tactic: 5898239, 426[Float(96,14,14)] -> 429[Float(576,14,14)]
---
> Layer(Convolution): (Unnamed Layer* 100) [Convolution] + (Unnamed Layer* 102) [ElementWise], Tactic: 57, 423[Float(576,14,14)], 417[Float(96,14,14)] -> 426[Float(96,14,14)]
> Layer(FusedConvActDirect): (Unnamed Layer* 103) [Convolution] + (Unnamed Layer* 105) [Activation], Tactic: 6291455, 426[Float(96,14,14)] -> 429[Float(576,14,14)]
40,41c39,40
< Layer(Convolution): (Unnamed Layer* 109) [Convolution] + (Unnamed Layer* 111) [ElementWise], Tactic: 0, 432[Float(576,14,14)], 426[Float(96,14,14)] -> 435[Float(96,14,14)]
< Layer(FusedConvActDirect): (Unnamed Layer* 112) [Convolution] + (Unnamed Layer* 114) [Activation], Tactic: 5898239, 435[Float(96,14,14)] -> 438[Float(576,14,14)]
---
> Layer(Convolution): (Unnamed Layer* 109) [Convolution] + (Unnamed Layer* 111) [ElementWise], Tactic: 57, 432[Float(576,14,14)], 426[Float(96,14,14)] -> 435[Float(96,14,14)]
> Layer(FusedConvActDirect): (Unnamed Layer* 112) [Convolution] + (Unnamed Layer* 114) [Activation], Tactic: 6291455, 435[Float(96,14,14)] -> 438[Float(576,14,14)]
44c43
< Layer(FusedConvActDirect): (Unnamed Layer* 120) [Convolution] + (Unnamed Layer* 122) [Activation], Tactic: 4259839, 443[Float(160,7,7)] -> 446[Float(960,7,7)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 120) [Convolution] + (Unnamed Layer* 122) [Activation], Tactic: 2097151, 443[Float(160,7,7)] -> 446[Float(960,7,7)]
46,47c45,46
< Layer(Convolution): (Unnamed Layer* 126) [Convolution] + (Unnamed Layer* 128) [ElementWise], Tactic: 0, 449[Float(960,7,7)], 443[Float(160,7,7)] -> 452[Float(160,7,7)]
< Layer(FusedConvActDirect): (Unnamed Layer* 129) [Convolution] + (Unnamed Layer* 131) [Activation], Tactic: 4259839, 452[Float(160,7,7)] -> 455[Float(960,7,7)]
---
> Layer(Convolution): (Unnamed Layer* 126) [Convolution] + (Unnamed Layer* 128) [ElementWise], Tactic: 57, 449[Float(960,7,7)], 443[Float(160,7,7)] -> 452[Float(160,7,7)]
> Layer(FusedConvActDirect): (Unnamed Layer* 129) [Convolution] + (Unnamed Layer* 131) [Activation], Tactic: 2097151, 452[Float(160,7,7)] -> 455[Float(960,7,7)]
49,50c48,49
< Layer(Convolution): (Unnamed Layer* 135) [Convolution] + (Unnamed Layer* 137) [ElementWise], Tactic: 0, 458[Float(960,7,7)], 452[Float(160,7,7)] -> 461[Float(160,7,7)]
< Layer(FusedConvActDirect): (Unnamed Layer* 138) [Convolution] + (Unnamed Layer* 140) [Activation], Tactic: 4259839, 461[Float(160,7,7)] -> 464[Float(960,7,7)]
---
> Layer(Convolution): (Unnamed Layer* 135) [Convolution] + (Unnamed Layer* 137) [ElementWise], Tactic: 57, 458[Float(960,7,7)], 452[Float(160,7,7)] -> 461[Float(160,7,7)]
> Layer(FusedConvActDirect): (Unnamed Layer* 138) [Convolution] + (Unnamed Layer* 140) [Activation], Tactic: 2097151, 461[Float(160,7,7)] -> 464[Float(960,7,7)]
53c52
< Layer(FusedConvActDirect): (Unnamed Layer* 146) [Convolution] + (Unnamed Layer* 148) [Activation], Tactic: 6750207, 469[Float(320,7,7)] -> 472[Float(1280,7,7)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 146) [Convolution] + (Unnamed Layer* 148) [Activation], Tactic: 4915199, 469[Float(320,7,7)] -> 472[Float(1280,7,7)]
56c55
< Layer(FusedConvActDirect): (Unnamed Layer* 152) [Convolution] + (Unnamed Layer* 153) [Activation], Tactic: 2097151, 475[Float(320,7,7)] -> 477[Float(160,7,7)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 152) [Convolution] + (Unnamed Layer* 153) [Activation], Tactic: 7012351, 475[Float(320,7,7)] -> 477[Float(160,7,7)]

The main difference is that some tactics are difference, but I noticed that Layer 11 was implemented as a Depthwise Convolution in one run vs a regular Convolution in another run

When I run the same test again, I also see that there is also 1 place where there is a different layer implementation selected (Convolution vs scudnn). Could these differences cause the precision error I am observing?

4490c4503
< Layer(Convolution): (Unnamed Layer* 22) [Convolution] + (Unnamed Layer* 24) [ElementWise], Tactic: 0, 345[Float(144,56,56)], 339[Float(24,56,56)] -> 348[Float(24,56,56)]
---
> Layer(scudnn): (Unnamed Layer* 22) [Convolution] + (Unnamed Layer* 24) [ElementWise], Tactic: -4420849921117327522, 345[Float(144,56,56)], 339[Float(24,56,56)] -> 348[Float(24,56,56)]
4493c4506
< Layer(FusedConvActDirect): (Unnamed Layer* 31) [Convolution], Tactic: 1179647, 354[Float(144,28,28)] -> 356[Float(32,28,28)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 31) [Convolution], Tactic: 4587519, 354[Float(144,28,28)] -> 356[Float(32,28,28)]
4496c4509
< Layer(Convolution): (Unnamed Layer* 39) [Convolution] + (Unnamed Layer* 41) [ElementWise], Tactic: 0, 362[Float(192,28,28)], 356[Float(32,28,28)] -> 365[Float(32,28,28)]
---
> Layer(Convolution): (Unnamed Layer* 39) [Convolution] + (Unnamed Layer* 41) [ElementWise], Tactic: 57, 362[Float(192,28,28)], 356[Float(32,28,28)] -> 365[Float(32,28,28)]
4499c4512
< Layer(Convolution): (Unnamed Layer* 48) [Convolution] + (Unnamed Layer* 50) [ElementWise], Tactic: 0, 371[Float(192,28,28)], 365[Float(32,28,28)] -> 374[Float(32,28,28)]
---
> Layer(Convolution): (Unnamed Layer* 48) [Convolution] + (Unnamed Layer* 50) [ElementWise], Tactic: 57, 371[Float(192,28,28)], 365[Float(32,28,28)] -> 374[Float(32,28,28)]
4502c4515
< Layer(FusedConvActDirect): (Unnamed Layer* 57) [Convolution], Tactic: 6750207, 380[Float(192,14,14)] -> 382[Float(64,14,14)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 57) [Convolution], Tactic: 2097151, 380[Float(192,14,14)] -> 382[Float(64,14,14)]
4505c4518
< Layer(Convolution): (Unnamed Layer* 65) [Convolution] + (Unnamed Layer* 67) [ElementWise], Tactic: 0, 388[Float(384,14,14)], 382[Float(64,14,14)] -> 391[Float(64,14,14)]
---
> Layer(Convolution): (Unnamed Layer* 65) [Convolution] + (Unnamed Layer* 67) [ElementWise], Tactic: 57, 388[Float(384,14,14)], 382[Float(64,14,14)] -> 391[Float(64,14,14)]
4508c4521
< Layer(Convolution): (Unnamed Layer* 74) [Convolution] + (Unnamed Layer* 76) [ElementWise], Tactic: 0, 397[Float(384,14,14)], 391[Float(64,14,14)] -> 400[Float(64,14,14)]
---
> Layer(Convolution): (Unnamed Layer* 74) [Convolution] + (Unnamed Layer* 76) [ElementWise], Tactic: 57, 397[Float(384,14,14)], 391[Float(64,14,14)] -> 400[Float(64,14,14)]
4511c4524
< Layer(Convolution): (Unnamed Layer* 83) [Convolution] + (Unnamed Layer* 85) [ElementWise], Tactic: 0, 406[Float(384,14,14)], 400[Float(64,14,14)] -> 409[Float(64,14,14)]
---
> Layer(Convolution): (Unnamed Layer* 83) [Convolution] + (Unnamed Layer* 85) [ElementWise], Tactic: 57, 406[Float(384,14,14)], 400[Float(64,14,14)] -> 409[Float(64,14,14)]
4515c4528
< Layer(FusedConvActDirect): (Unnamed Layer* 94) [Convolution] + (Unnamed Layer* 96) [Activation], Tactic: 5898239, 417[Float(96,14,14)] -> 420[Float(576,14,14)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 94) [Convolution] + (Unnamed Layer* 96) [Activation], Tactic: 7012351, 417[Float(96,14,14)] -> 420[Float(576,14,14)]
4517,4518c4530,4531
< Layer(Convolution): (Unnamed Layer* 100) [Convolution] + (Unnamed Layer* 102) [ElementWise], Tactic: 0, 423[Float(576,14,14)], 417[Float(96,14,14)] -> 426[Float(96,14,14)]
< Layer(FusedConvActDirect): (Unnamed Layer* 103) [Convolution] + (Unnamed Layer* 105) [Activation], Tactic: 5898239, 426[Float(96,14,14)] -> 429[Float(576,14,14)]
---
> Layer(Convolution): (Unnamed Layer* 100) [Convolution] + (Unnamed Layer* 102) [ElementWise], Tactic: 57, 423[Float(576,14,14)], 417[Float(96,14,14)] -> 426[Float(96,14,14)]
> Layer(FusedConvActDirect): (Unnamed Layer* 103) [Convolution] + (Unnamed Layer* 105) [Activation], Tactic: 7012351, 426[Float(96,14,14)] -> 429[Float(576,14,14)]
4520,4521c4533,4534
< Layer(Convolution): (Unnamed Layer* 109) [Convolution] + (Unnamed Layer* 111) [ElementWise], Tactic: 0, 432[Float(576,14,14)], 426[Float(96,14,14)] -> 435[Float(96,14,14)]
< Layer(FusedConvActDirect): (Unnamed Layer* 112) [Convolution] + (Unnamed Layer* 114) [Activation], Tactic: 5898239, 435[Float(96,14,14)] -> 438[Float(576,14,14)]
---
> Layer(Convolution): (Unnamed Layer* 109) [Convolution] + (Unnamed Layer* 111) [ElementWise], Tactic: 57, 432[Float(576,14,14)], 426[Float(96,14,14)] -> 435[Float(96,14,14)]
> Layer(FusedConvActDirect): (Unnamed Layer* 112) [Convolution] + (Unnamed Layer* 114) [Activation], Tactic: 7012351, 435[Float(96,14,14)] -> 438[Float(576,14,14)]
4523,4524c4536,4537
< Layer(FusedConvActDirect): (Unnamed Layer* 118) [Convolution], Tactic: 7012351, 441[Float(576,7,7)] -> 443[Float(160,7,7)]
< Layer(FusedConvActDirect): (Unnamed Layer* 120) [Convolution] + (Unnamed Layer* 122) [Activation], Tactic: 4259839, 443[Float(160,7,7)] -> 446[Float(960,7,7)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 118) [Convolution], Tactic: 2097151, 441[Float(576,7,7)] -> 443[Float(160,7,7)]
> Layer(FusedConvActDirect): (Unnamed Layer* 120) [Convolution] + (Unnamed Layer* 122) [Activation], Tactic: 1179647, 443[Float(160,7,7)] -> 446[Float(960,7,7)]
4526,4527c4539,4540
< Layer(Convolution): (Unnamed Layer* 126) [Convolution] + (Unnamed Layer* 128) [ElementWise], Tactic: 0, 449[Float(960,7,7)], 443[Float(160,7,7)] -> 452[Float(160,7,7)]
< Layer(FusedConvActDirect): (Unnamed Layer* 129) [Convolution] + (Unnamed Layer* 131) [Activation], Tactic: 4259839, 452[Float(160,7,7)] -> 455[Float(960,7,7)]
---
> Layer(Convolution): (Unnamed Layer* 126) [Convolution] + (Unnamed Layer* 128) [ElementWise], Tactic: 57, 449[Float(960,7,7)], 443[Float(160,7,7)] -> 452[Float(160,7,7)]
> Layer(FusedConvActDirect): (Unnamed Layer* 129) [Convolution] + (Unnamed Layer* 131) [Activation], Tactic: 1179647, 452[Float(160,7,7)] -> 455[Float(960,7,7)]
4529,4530c4542,4543
< Layer(Convolution): (Unnamed Layer* 135) [Convolution] + (Unnamed Layer* 137) [ElementWise], Tactic: 0, 458[Float(960,7,7)], 452[Float(160,7,7)] -> 461[Float(160,7,7)]
< Layer(FusedConvActDirect): (Unnamed Layer* 138) [Convolution] + (Unnamed Layer* 140) [Activation], Tactic: 4259839, 461[Float(160,7,7)] -> 464[Float(960,7,7)]
---
> Layer(Convolution): (Unnamed Layer* 135) [Convolution] + (Unnamed Layer* 137) [ElementWise], Tactic: 57, 458[Float(960,7,7)], 452[Float(160,7,7)] -> 461[Float(160,7,7)]
> Layer(FusedConvActDirect): (Unnamed Layer* 138) [Convolution] + (Unnamed Layer* 140) [Activation], Tactic: 1179647, 461[Float(160,7,7)] -> 464[Float(960,7,7)]
4532c4545
< Layer(FusedConvActDirect): (Unnamed Layer* 144) [Convolution], Tactic: 7012351, 467[Float(960,7,7)] -> 469[Float(320,7,7)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 144) [Convolution], Tactic: 2097151, 467[Float(960,7,7)] -> 469[Float(320,7,7)]
4535,4536c4548,4549
< Layer(FusedConvActDirect): (Unnamed Layer* 150) [Convolution] + (Unnamed Layer* 151) [Activation], Tactic: 7012351, 473[Float(1280,7,7)] -> 475[Float(320,7,7)]
< Layer(FusedConvActDirect): (Unnamed Layer* 152) [Convolution] + (Unnamed Layer* 153) [Activation], Tactic: 2097151, 475[Float(320,7,7)] -> 477[Float(160,7,7)]
---
> Layer(FusedConvActDirect): (Unnamed Layer* 150) [Convolution] + (Unnamed Layer* 151) [Activation], Tactic: 2097151, 473[Float(1280,7,7)] -> 475[Float(320,7,7)]
> Layer(FusedConvActDirect): (Unnamed Layer* 152) [Convolution] + (Unnamed Layer* 153) [Activation], Tactic: 7012351, 475[Float(320,7,7)] -> 477[Float(160,7,7)]

Hi,

Engine building is non-deterministic, so variation is expected there. Or you can save the engine using --saveEngine=/path/to/engine and then load it to get consistent outputs, --loadEngine=/path/to/engine.

Thank you.