hi,
I have upgrade my code to TensorRT-7.0.0.11. The issue seems slightly different. I have run a test of 20 worker threads each time for 100 round. Among the 2000 result files, I get 1937 files with the same md5(fa6d87b1f5b220a30f7647654a60f6c0) and 63 for another md5(9d661f06c6a3c2f6c117c487227cbb9e).
I notice the different md5 seems happend together with the following runtime error:
[TRT] FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
FAILED_EXECUTION: std::exception
[05/21/2020-15:04:09] [E] [TRT] FAILED_EXECUTION: std::exception
[05/21/2020-15:04:09] [F] [TRT] Assertion failed: *refCount > 0
../rtSafe/WeightsPtr.cpp:20
Aborting...
I guess that should be the reason I get wrong result. So now the situation is:
1. the slightly precision difference among parallel disappear. All the valid result is the same.
2. there is some parallel issue lead to runtime error and invalid result that I still need to resolve.
After I search the runtime error, I am lead to this post which seems similar to my case:
So that’s latest progress on my issue.
Thanks.
runtime env:
CentOS 7.5.1804
GPU: TITAN V
Driver Version: 410.48
CUDA version: 10.0
TensorRT-7.0.0.11
Cudnn: 7.6.5
