Is TensorRT “floating-point 16 precision mode” non-deterministic on Jetson TX2?

hibestil · July 19, 2019, 5:19pm

I’m using TensorRT FP16 precision mode to optimize my deep learning model. And I use this optimised model on Jetson TX2. While testing the model, I have observed that TensorRT inference engine is not deterministic. In other words, my optimized model gives different FPS values between 40 and 120 FPS for same input images.

I started to think that the source of the non-determinism is floating point operations when I see this comment about CUDA:

https://devtalk.nvidia.com/default/topic/782499/cuda-programming-and-performance/cuda-result-changes-time-to-time/post/4338626/#4338626

Is type of precision such as FP16, FP32 and INT8 affects determinism of TensorRT on Jetson TX2? Or anything?

Do you have any thoughs?

Best regards.

AastaLLL · July 23, 2019, 6:04am

Hi,

1.
Please noticed that tensorRT engine is non-portable.
Do you build TensorRT directly on the TX2? If not, please apply this to avoid any unexpected issue.

2.
Have you serialized your engine into the file first?
Please noticed that TensorRT may choose different implementation to create an runtime engine based on the system status.
To get a reproducible result, it’s recommended to use serialized file instead of creating an engine from model each time.

Thanks.

hibestil · July 23, 2019, 9:05am

Yes, I have builded the engine directly on the TX2.

Yes. Firstly I trained my model with TensorFlow. Secondly I created an engine from TF Model. Then I serialized it to a file (a “.plan” file).

So, I’ve already did what did you noticed.

By the way, When I doing research, I see discussions about non-determinism of TensorFlow and CuDNN. What’s your thoughts on these discussions?
I’ve listed below some of them:

"Some ops seem to be non-deterministic on GPU. This can be annoying when debugging instability issues, and there are also some niche applications where deterministic functions are absolutely necessary." : https://github.com/google/jax/issues/565
"GPU calculations are non-deterministic because the thread scheduling is non-deterministic. Floating-point errors are accumulated in unpredictable ways for operations that are not associative -- a consequence of the GPU hardware itself, not TensorFlow." : https://github.com/openai/baselines/issues/805
"On GPU, small amount of non-deterministic results is expected. TensorFlow uses the Eigen library, which uses Cuda atomics to implement reduction operations, such as tf.reduce_sum etc. Those operations are non-determnistical." (zheng-xq) : https://github.com/tensorflow/tensorflow/issues/2732#issuecomment-224661591
"... anything using cuda atomics is non-deterministic, so a way to narrow it down is to see which CuDNN algorithms use CUDA atomics..." https://github.com/tensorflow/tensorflow/issues/3103#issuecomment-286302405
"You can't until every operation on cuDNN is not completely deterministic..." : https://stackoverflow.com/questions/44800055/how-to-create-a-cnn-with-deterministic-operations-in-tensorflow-on-a-gpu

As far as I know, if wrong please inform me, TensorRT uses CUDA and CuDNN at backend. Can non-determinism of CUDA and CuDNN affect my TRT engine?

I could not find anything about timing determinsim on developer guide. On TRT developer guide there is a quote about reproducibility : “By design, most of cuDNN’s routines from a given version generate the same bit-wise results across runs when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility(determinism) is not guaranteed across versions…” https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility

Is TensorRT guarantees timing determinism for all of its operations?

Thank you @AastaLLL

AastaLLL · July 24, 2019, 9:23am

Hi,

Would you mind to try if this behavior is also true with the fp32 precision.

Suppose the cause of non-deterministic comes from precision and should be improved by fp32.
The difference should be small and won’t make any effect in the accuracy.
But this is model-dependent. It is possible that the difference is amplified by certain special layer, like activation.

Thanks.

hibestil · July 24, 2019, 11:33am

Hi,
I tested the model with fp32 precision. You’re right. Results show that, when we use fp32 mode, latency of engine is little bit increased against fp16 mode. But there is almost no affect of changing mode in the accuracy . But both precision modes are non-deterministic. Inference timings (FPS) for same image still various.

Trying INT8 precision mode might produce determnistic results, but this mode is not supported on TX2 .

I can share my model architecture (https://pasteboard.co/I752Wby.png) with you. What is the main reason of non-determinism? Model architecture? The optimized engine which is executed by TRT? Or both of them?

I think reason of determinism is about TRT. What’s your opinion?

Thank you,

AastaLLL · August 6, 2019, 9:11am

Hi,

This is related to the cuDNN algorithm.
Some cuDNN algorithm is non deterministic.

Would you mind to share the operation of you TensoRT engine with us?
We want to check if there is any non deterministic operation used inside your model.

By the way, it’s also worthy to give TensorRT 5.1 a try.
Thanks.

Topic		Replies	Views
Non-deterministic TensorRT engine building TensorRT tensorrt	3	758	March 10, 2021
How to run TensorRT based deep learning model as real time? TensorRT	4	1474	July 18, 2019
TensorRT’s result is different between 1080ti and jetson tx2 TensorRT	1	930	January 24, 2019
Deterministic TensorRT optimization TensorRT tensorrt	7	922	August 27, 2020
Is TensorRT inference deterministic/reproducibile? TensorRT tensorrt	4	3073	December 1, 2020
Tensorrt can not speed up well TensorRT	7	1837	June 29, 2022
What affects the floating point accuracy of an tensorrt engine output? Jetson AGX Orin tensorrt	8	664	May 10, 2023
Question about TensorRT reproducibility on different architectures TensorRT	2	1055	September 16, 2021
Time of inference in FP16 and FP32 is the same Jetson TX2 tensorrt	19	2158	August 10, 2022
No performance difference between Float16 and Float32 optimized TensorRT models Jetson AGX Xavier tensorrt	3	3504	July 31, 2021

Is TensorRT &ldquo;floating-point 16 precision mode&rdquo; non-deterministic on Jetson TX2?

Related topics

Is TensorRT “floating-point 16 precision mode” non-deterministic on Jetson TX2?