I implemented my ssd model at 1080ti and jetson tx2, but they have different result, although these changes have little impact on the final detection output, but I want to know why.

I printed some of the media layers’s statistical data: min, min_index, max, max_index, sum, mean, tss(sum of squares) and var, and I found something strange：

1080ti:

[mbox_conf]:

min:-6.643838(0.000000) min_index:174167.000000(0.000000) max:16.168436(0.000000) max_index:140217.000000(0.000000) sum:-708.053076(-0.000049) mean:-0.003861(-0.000000) tss:1259940.186802(-0.002180) var:6.870952(-0.000000)

jetson TX2 ：

[mbox_conf]:

min:-6.640625(0.000000) min_index:174167.000000(0.000000) max:16.162971(0.000000) max_index:140217.000000(0.000000) sum:-709.651303(0.000066) mean:-0.003870(0.000000) tss:1259667.312911(0.002610) var:6.869464(0.000000)

The numbers in parentheses represent the difference between this time and last time I run the same pirture at the same paltform and these two models’ precision are all FP32. The 1080ti and jetson tx2 have slight difference and any two inference operation will result in slight difference too, the diff(the numbers in parentheses) is only in sum and tss. And I found that these slight changes began at conv4_3.

Is there a generation of random numbers during the inference? Or this is the cumulative error of CUDA bottom numerical calculation？

Is there a statement about tensorrt cut the precision of some heavy operations, such as cutting 32-bit to 16-bit, to reduce the computation cost during the inference. And then after getting 16-bit results, it restores them to 32-bit by filling some random numbers？

And the optimization of tensorrt is hardware-dependent or not?