Same input different output with Tensorrt4.0.1.6

hl1008 · May 19, 2020, 11:19am

Description

I have a inference engine written with Tensorrt 4.0.1.6. When given one file as input it will give one output file. It support multi-thread and when given multi input files it will create one thread per file. In order to check the correctness, I make 10 copies of one input file and try to inference them. The problem is sometimes it give 10 same output with same md5, and sometimes it give 10 outputs some of which has different md5.

When I dump the different md5 output to text, I find they seems just have different precision

I have check my code’s thread safety for several times and find nothing suspicious. I use independent ExecutionContext and binding buffers for each thread according to the reference. So I expect some help of the direction. Does it seem like my code bugs or it caused by some internal logic that I don’t know. Thanks!

Environment

TensorRT Version: Tensorrt 4.0.1.6
GPU Type: GeForce GTX TITAN X
Nvidia Driver Version: 430.50
CUDA Version: 10.1
CUDNN Version: 7.0.5
Operating System + Version: CentOS Linux release 7.4.1708
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

SunilJB · May 19, 2020, 12:19pm

Precision difference seems to be very low, approx 0.000001 in this case.

TRT 4.0 is very old version, will recommend you to use latest supported TRT version on your device in order to get better performance.

Thanks

hl1008 · May 20, 2020, 2:56am

Does it mean the same input may get the slightly different precision output? Since I used to compare md5 of the intermediate result to locate bugs. So does it mean I should not never do that again? Thanks.

SunilJB · May 20, 2020, 5:18am

TRT results should be deterministic if it’s exactly the same plan file.
If you are building different engines in different threads, that’s expected.

Can you try latest TRT release and let us know if the issue persist even after using the same plan file?

Thanks

hl1008 · May 20, 2020, 6:16am

Ok, get it. Thanks. I will try new version later and reply here.

hl1008 · May 21, 2020, 7:46am

hi,

I have upgrade my code to TensorRT-7.0.0.11. The issue seems slightly different. I have run a test of 20 worker threads each time for 100 round. Among the 2000 result files, I get 1937 files with the same md5(fa6d87b1f5b220a30f7647654a60f6c0) and 63 for another md5(9d661f06c6a3c2f6c117c487227cbb9e).

I notice the different md5 seems happend together with the following runtime error:

[TRT] FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
[05/21/2020-15:04:09] [E] [TRT] FAILED_EXECUTION: std::exception
[05/21/2020-15:04:09] [F] [TRT] Assertion failed: *refCount > 0
../rtSafe/WeightsPtr.cpp:20
Aborting...

I guess that should be the reason I get wrong result. So now the situation is:

1. the slightly precision difference among parallel disappear. All the valid result is the same.
2. there is some parallel issue lead to runtime error and invalid result that I still need to resolve.

After I search the runtime error, I am lead to this post which seems similar to my case:

So that’s latest progress on my issue.

Thanks.

runtime env:

CentOS 7.5.1804
GPU: TITAN V
Driver Version: 410.48
CUDA version: 10.0
TensorRT-7.0.0.11
Cudnn: 7.6.5

SunilJB · May 22, 2020, 10:46am

Can you share the sample script and model files to reproduce the issue so we can help better?
If possible please share the verbose error log as well.

Thanks

hl1008 · May 23, 2020, 3:49am

It is company product code written in c++ and I’m sorry I can’t share it since it is for commercial. I have switch the TRT log to VERBOSE. Hope that can given you some clue.
Thanks.log.log (2.0 MB)

SunilJB · May 23, 2020, 7:40pm

Thanks, we will look into it and update you accordingly.
Meanwhile, can you try the fix suggested in below topic?

Thanks

hl1008 · May 25, 2020, 6:59am

I have tried 6.0.1.5 and it seems a good workaround for me. There is no runtime error or crash. Though the total 2000 results fall to 2 distinct md5, they are the same in each round. It seems like some initial value issue. And it is much better than trt4 which has different md5 in a single round. Anyway I have checked the both results are valid with only precision difference. We will use trt6.0.1.5 before you fix the issue in new version. Thanks for your help.

Topic		Replies	Views
Different TensorRT inference results from the same input when batchSize > 1 TensorRT	2	2069	October 12, 2021
Is TensorRT “floating-point 16 precision mode” non-deterministic on Jetson TX2? Jetson TX2	6	1472	October 18, 2021
Different TensorRT inference results for the same input TensorRT	2	1552	October 23, 2018
Thread safe while use tensorRT TensorRT	1	2685	March 25, 2019
Same tensorRT code get different result TensorRT	10	2267	July 23, 2019
Output changes for the same input when the neural net has been run for several times? TensorRT	19	1769	October 30, 2018
Non-deterministic TensorRT engine building TensorRT tensorrt	3	637	March 10, 2021
Error run 2 context parallel in TensorRT7 TensorRT	13	2698	July 5, 2021
Output is not stable TensorRT	7	620	October 12, 2021
TensorRT inference result of one image don't keep the same in high qps TensorRT tensorrt	1	633	June 29, 2022

Same input different output with Tensorrt4.0.1.6

Description

Environment

Relevant Files

Steps To Reproduce

Related topics