TRT 7.1.3 - invalid results, but only on Jetpack

Description

We’re using the same TensorRT wrapper code across multiple OSs. What we’re seeing are invalid results coming from a specific SSD model (caffe), but only on TRT 7.1.3 + Jetpack. The same model and code produce expected results with TRT 7.1.3 on Windows. Moreover, the same model and code produce expected results on Jetpack when linking with/running with TRT 6.0.1.5.

I combed through the change log, but not seeing anything going from 6.0.1->7.1.3 that could explain this.

Example of the first 20 results when ran with TRT 6.0.1.5:
21/02/09-16:27:53.839 E <222698> [root] 0 19 0.964429 0.58098 0.139638 0.968038 0.984788
21/02/09-16:27:53.839 E <222698> [root] 0 19 0.960657 0.176006 0.153037 0.700486 1
21/02/09-16:27:53.839 E <222698> [root] 0 19 0.954768 0 0.142767 0.320381 0.998101
21/02/09-16:27:53.840 E <222698> [root] 0 14 0.843132 0.6081 0.209714 0.750129 0.444225
21/02/09-16:27:53.840 E <222698> [root] 0 14 0.813285 0.32177 0.212763 0.47296 0.455428
21/02/09-16:27:53.840 E <222698> [root] 0 14 0.7806 0.12243 0.238069 0.277538 0.500606
21/02/09-16:27:53.840 E <222698> [root] 0 14 0.650497 0.421823 0.440682 0.56413 0.647324
21/02/09-16:27:53.840 E <222698> [root] 0 6 0.0418459 0.519575 0.945452 0.596312 0.998672
21/02/09-16:27:53.840 E <222698> [root] 0 6 0.0389161 0.583094 0.890014 0.621445 0.991069
21/02/09-16:27:53.840 E <222698> [root] 0 6 0.0375866 0.469749 0.933568 0.52858 0.998796
21/02/09-16:27:53.840 E <222698> [root] 0 19 0.0373655 0.475022 0.939281 0.497292 0.968301
21/02/09-16:27:53.840 E <222698> [root] 0 6 0.0355764 0.539965 0.895792 0.595048 0.973548
21/02/09-16:27:53.840 E <222698> [root] 0 19 0.033691 0.449244 0.945878 0.472696 0.971618
21/02/09-16:27:53.841 E <222698> [root] 0 6 0.0330456 0.406388 0.942024 0.47575 0.997496
21/02/09-16:27:53.841 E <222698> [root] 0 14 0.0327229 0.16491 0.861377 0.192026 0.890544
21/02/09-16:27:53.843 E <222698> [root] 0 12 0.0326452 0.190628 0.80501 0.822503 0.997762
21/02/09-16:27:53.843 E <222698> [root] 0 6 0.031406 0.528162 0.760967 0.572475 0.857066
21/02/09-16:27:53.844 E <222698> [root] 0 14 0.0302982 0.417627 0.928125 0.448149 0.948701
21/02/09-16:27:53.845 E <222698> [root] 0 14 0.0302664 0.389174 0.953735 0.433152 0.975139

Same input, same hardware, but with 7.1.3. Note the coordinates that commonly appear (0.1 0.1 0.2 0.2 and 0 0 0 0):
21/02/09-16:16:13.222 E <220081> [root] 0 19 0.964561 0.1 0.1 0.2 0.2
21/02/09-16:16:13.223 E <220081> [root] 0 19 0.961353 0 0.536727 0.463273 1
21/02/09-16:16:13.224 E <220081> [root] 0 19 0.954514 0.536727 0.203394 1 0.796606
21/02/09-16:16:13.224 E <220081> [root] 0 19 0.897773 0.536727 0.536727 1 1
21/02/09-16:16:13.225 E <220081> [root] 0 19 0.860424 0.157498 0.131756 0.74709 0.989398
21/02/09-16:16:13.225 E <220081> [root] 0 14 0.841381 0.338757 0.438757 0.561243 0.661243
21/02/09-16:16:13.226 E <220081> [root] 0 14 0.813287 0.936514 0.319156 1 0.55064
21/02/09-16:16:13.226 E <220081> [root] 0 14 0.780265 0.722458 0.338098 0.877597 0.600533
21/02/09-16:16:13.227 E <220081> [root] 0 14 0.650944 0.038757 0.738757 0.261243 0.961243
21/02/09-16:16:13.227 E <220081> [root] 0 14 0.584413 0.038757 0.438757 0.261243 0.661243
21/02/09-16:16:13.228 E <220081> [root] 0 14 0.518747 0.633884 0.341663 0.71223 0.59662
21/02/09-16:16:13.229 E <220081> [root] 0 14 0.411828 0.196967 0.743934 0.303033 0.956066
21/02/09-16:16:13.229 E <220081> [root] 0 14 0.136188 0.685778 0.766292 0.735275 0.865287
21/02/09-16:16:13.230 E <220081> [root] 0 6 0.04224 0.1 0.1 0.2 0.2
21/02/09-16:16:13.230 E <220081> [root] 0 19 0.0370162 0 0 0 0
21/02/09-16:16:13.230 E <220081> [root] 0 19 0.0348733 0 0.258328 0.171998 0.931812
21/02/09-16:16:13.231 E <220081> [root] 0 19 0.0341472 0 0 0 0
21/02/09-16:16:13.231 E <220081> [root] 0 14 0.0328926 0 0 0 0
21/02/09-16:16:13.232 E <220081> [root] 0 19 0.0328726 0 0 0 0
21/02/09-16:16:13.232 E <220081> [root] 0 12 0.0327073 0.1 0.1 0.2 0.2

Update: we’ve confirmed this can be reproduced with the SSD model mentioned in the Release notes here: https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-5.html#rel_5-0-2 – after editing it as described. Same exact bogus output with the same characteristics. The model works with 6.0.1.5.

Environment

TensorRT Version: 7.1.3
GPU Type: Xavier
Nvidia Driver Version: # R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020
CUDA Version: 10.2
CUDNN Version: 8.0
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi @alexm5m91,

This looks like a Jetson issue. We recommend you to raise it to the respective platform from the below link

Thanks!

I’ve crossposted there (Issue with TensorRT 7.1.3 on Jetson AGX - #2 by alexm5m91), but it doesn’t seem to get any love.

Meanwhile, I’m quite stuck with this: given the nature of Jetpack I can’t try if later version (say, 7.2) fixes it. I was up and down our code (and, in fact, isolated it to a standalone program), and all seems to be correct – except 100% not working on Jetson. What options do we have?