cudaStreamSynchronize problem on Tesla T4

Description

Even with cudaStreamSynchronize, the data retrieved from the stream is still not finished. Need to call usleep to make some delay or use export CUDA_LAUNCH_BLOCKING=1

Environment

TensorRT Version: nvcr.io/nvidia/tensorrt:20.0
-py3
GPU Type: Tesla T4 on AWS
Nvidia Driver Version: 418.87.00
CUDA Version: 11.0
CUDNN Version:
Operating System + Version: nvcr.io/nvidia/tensorrt:20.0
-py3 (host: aws prebuilt gpu ami)
:

Steps To Reproduce

We start 4 cuda stream to do inference.
cudaMemcpyAsync (host to device) - stream 1
enqueue - stream 1
cudaMemcpyAsync(device to host) - steam 1
cudaMemcpyAsync (host to device) - stream 2
enqueue - stream 2
cudaMemcpyAsync(device to host) - steam 2
cudaMemcpyAsync (host to device) - stream 3
enqueue - stream 3
cudaMemcpyAsync(device to host) - steam 3
cudaMemcpyAsync (host to device) - stream 4
enqueue - stream 4
cudaMemcpyAsync(device to host) - steam 4

Then we call cudaStreamSynchronize on stream 1.
Sometimes the output from stream 1 is incorrect, If use with usleep or export CUDA_LAUNCH_BLOCKING=1., then the result is okay.

There seems to be a race condition in cudaStreamSynchronize.

We tried on GeFore TITAN X also with driver version 430.50 and same version of tensorrt with same code, the problem does not occurred on GeFore TITAN X.

Hi @ek9852,
For Tesla T4,
CUDA ToolkitLinux x86_64 Driver VersionWindows x86_64 Driver VersionCUDA 11.0.189 RC>= 450.36.06>= 451.22
Request you to try reproducing the issue after upgrading Nvidia Driver Version, as the latest one is r450.
Please find the support matrix to check for the same.
https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-713/support-matrix/index.html .

Please share your code in case if the issue persist.

Thanks!