Same input different output with Tensorrt4.0.1.6

Description

I have a inference engine written with Tensorrt 4.0.1.6. When given one file as input it will give one output file. It support multi-thread and when given multi input files it will create one thread per file. In order to check the correctness, I make 10 copies of one input file and try to inference them. The problem is sometimes it give 10 same output with same md5, and sometimes it give 10 outputs some of which has different md5.
image

image

When I dump the different md5 output to text, I find they seems just have different precision

I have check my code’s thread safety for several times and find nothing suspicious. I use independent ExecutionContext and binding buffers for each thread according to the reference. So I expect some help of the direction. Does it seem like my code bugs or it caused by some internal logic that I don’t know. Thanks!

Environment

TensorRT Version: Tensorrt 4.0.1.6
GPU Type: GeForce GTX TITAN X
Nvidia Driver Version: 430.50
CUDA Version: 10.1
CUDNN Version: 7.0.5
Operating System + Version: CentOS Linux release 7.4.1708
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Precision difference seems to be very low, approx 0.000001 in this case.

TRT 4.0 is very old version, will recommend you to use latest supported TRT version on your device in order to get better performance.

Thanks

Does it mean the same input may get the slightly different precision output? Since I used to compare md5 of the intermediate result to locate bugs. So does it mean I should not never do that again? Thanks.

TRT results should be deterministic if it’s exactly the same plan file.
If you are building different engines in different threads, that’s expected.

Can you try latest TRT release and let us know if the issue persist even after using the same plan file?

Thanks

Ok, get it. Thanks. I will try new version later and reply here.

hi,

I have upgrade my code to TensorRT-7.0.0.11. The issue seems slightly different. I have run a test of 20 worker threads each time for 100 round. Among the 2000 result files, I get 1937 files with the same md5(fa6d87b1f5b220a30f7647654a60f6c0) and 63 for another md5(9d661f06c6a3c2f6c117c487227cbb9e).

I notice the different md5 seems happend together with the following runtime error:

[TRT] FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
[05/21/2020-15:04:09] [E] [TRT] FAILED_EXECUTION: std::exception
[05/21/2020-15:04:09] [F] [TRT] Assertion failed: *refCount > 0
../rtSafe/WeightsPtr.cpp:20
Aborting...

I guess that should be the reason I get wrong result. So now the situation is:

1. the slightly precision difference among parallel disappear. All the valid result is the same.
2. there is some parallel issue lead to runtime error and invalid result that I still need to resolve.

After I search the runtime error, I am lead to this post which seems similar to my case:

So that’s latest progress on my issue.

Thanks.

runtime env:

CentOS 7.5.1804
GPU: TITAN V
Driver Version: 410.48
CUDA version: 10.0
TensorRT-7.0.0.11
Cudnn: 7.6.5

Can you share the sample script and model files to reproduce the issue so we can help better?
If possible please share the verbose error log as well.

Thanks

It is company product code written in c++ and I’m sorry I can’t share it since it is for commercial. I have switch the TRT log to VERBOSE. Hope that can given you some clue.
Thanks.log.log (2.0 MB)

Thanks, we will look into it and update you accordingly.
Meanwhile, can you try the fix suggested in below topic?

Thanks

1 Like

I have tried 6.0.1.5 and it seems a good workaround for me. There is no runtime error or crash. Though the total 2000 results fall to 2 distinct md5, they are the same in each round. It seems like some initial value issue. And it is much better than trt4 which has different md5 in a single round. Anyway I have checked the both results are valid with only precision difference. We will use trt6.0.1.5 before you fix the issue in new version. Thanks for your help.