Is TensorRT “floating-point 16 precision mode” non-deterministic on Jetson TX2?

I’m using TensorRT FP16 precision mode to optimize my deep learning model. And I use this optimised model on Jetson TX2. While testing the model, I have observed that TensorRT inference engine is not deterministic. In other words, my optimized model gives different FPS values between 40 and 120 FPS for same input images.

I started to think that the source of the non-determinism is floating point operations when I see this comment about CUDA:

https://devtalk.nvidia.com/default/topic/782499/cuda-programming-and-performance/cuda-result-changes-time-to-time/post/4338626/#4338626

Is type of precision such as FP16, FP32 and INT8 affects determinism of TensorRT on Jetson TX2? Or anything?

Do you have any thoughs?

Best regards.

Hi,

1.
Please noticed that tensorRT engine is non-portable.
Do you build TensorRT directly on the TX2? If not, please apply this to avoid any unexpected issue.

2.
Have you serialized your engine into the file first?
Please noticed that TensorRT may choose different implementation to create an runtime engine based on the system status.
To get a reproducible result, it’s recommended to use serialized file instead of creating an engine from model each time.

Thanks.

Yes, I have builded the engine directly on the TX2.

Yes. Firstly I trained my model with TensorFlow. Secondly I created an engine from TF Model. Then I serialized it to a file (a “.plan” file).

So, I’ve already did what did you noticed.

By the way, When I doing research, I see discussions about non-determinism of TensorFlow and CuDNN. What’s your thoughts on these discussions?
I’ve listed below some of them:

As far as I know, if wrong please inform me, TensorRT uses CUDA and CuDNN at backend. Can non-determinism of CUDA and CuDNN affect my TRT engine?

I could not find anything about timing determinsim on developer guide. On TRT developer guide there is a quote about reproducibility : “By design, most of cuDNN’s routines from a given version generate the same bit-wise results across runs when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility(determinism) is not guaranteed across versions…” https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility

Is TensorRT guarantees timing determinism for all of its operations?

Thank you @AastaLLL

Hi,

Would you mind to try if this behavior is also true with the fp32 precision.

Suppose the cause of non-deterministic comes from precision and should be improved by fp32.
The difference should be small and won’t make any effect in the accuracy.
But this is model-dependent. It is possible that the difference is amplified by certain special layer, like activation.

Thanks.

Hi,
I tested the model with fp32 precision. You’re right. Results show that, when we use fp32 mode, latency of engine is little bit increased against fp16 mode. But there is almost no affect of changing mode in the accuracy . But both precision modes are non-deterministic. Inference timings (FPS) for same image still various.

Trying INT8 precision mode might produce determnistic results, but this mode is not supported on TX2 .

I can share my model architecture (https://pasteboard.co/I752Wby.png) with you. What is the main reason of non-determinism? Model architecture? The optimized engine which is executed by TRT? Or both of them?

I think reason of determinism is about TRT. What’s your opinion?

Thank you,

Hi,

This is related to the cuDNN algorithm.
Some cuDNN algorithm is non deterministic.

Would you mind to share the operation of you TensoRT engine with us?
We want to check if there is any non deterministic operation used inside your model.

By the way, it’s also worthy to give TensorRT 5.1 a try.
Thanks.