TensorRT’s result is different between 1080ti and jetson tx2

I implemented my ssd model at 1080ti and jetson tx2, but they have different result, although these changes have little impact on the final detection output, but I want to know why.

I printed some of the media layers’s statistical data: min, min_index, max, max_index, sum, mean, tss(sum of squares) and var, and I found something strange:

min:-6.643838(0.000000) min_index:174167.000000(0.000000) max:16.168436(0.000000) max_index:140217.000000(0.000000) sum:-708.053076(-0.000049) mean:-0.003861(-0.000000) tss:1259940.186802(-0.002180) var:6.870952(-0.000000)

jetson TX2 :
min:-6.640625(0.000000) min_index:174167.000000(0.000000) max:16.162971(0.000000) max_index:140217.000000(0.000000) sum:-709.651303(0.000066) mean:-0.003870(0.000000) tss:1259667.312911(0.002610) var:6.869464(0.000000)

The numbers in parentheses represent the difference between this time and last time I run the same pirture at the same paltform and these two models’ precision are all FP32. The 1080ti and jetson tx2 have slight difference and any two inference operation will result in slight difference too, the diff(the numbers in parentheses) is only in sum and tss. And I found that these slight changes began at conv4_3.

Is there a generation of random numbers during the inference? Or this is the cumulative error of CUDA bottom numerical calculation?

Is there a statement about tensorrt cut the precision of some heavy operations, such as cutting 32-bit to 16-bit, to reduce the computation cost during the inference. And then after getting 16-bit results, it restores them to 32-bit by filling some random numbers?

And the optimization of tensorrt is hardware-dependent or not?


Optimization of tensorrt is GPU-dependent. A generated TensorRT engine is valid for a specific GPU — more precisely, a specific CUDA Compute Capability. For example, if you generate a PLAN for an NVIDIA P4 (compute capability 6.1) you can’t use that PLAN on an NVIDIA Tesla V100 (compute capability 7.0).