cudaStreamSynchronize problem on Tesla T4

Description

Even with cudaStreamSynchronize, the data retrieved from the stream is still not finished. Need to call usleep to make some delay or use export CUDA_LAUNCH_BLOCKING=1

Environment

TensorRT Version: nvcr.io/nvidia/tensorrt:20.0
-py3
GPU Type: Tesla T4 on AWS
Nvidia Driver Version: 418.87.00
CUDA Version: 11.0
CUDNN Version:
Operating System + Version: nvcr.io/nvidia/tensorrt:20.0
-py3 (host: aws prebuilt gpu ami)
:

Steps To Reproduce

We start 4 cuda stream to do inference.
cudaMemcpyAsync (host to device) - stream 1
enqueue - stream 1
cudaMemcpyAsync(device to host) - steam 1
cudaMemcpyAsync (host to device) - stream 2
enqueue - stream 2
cudaMemcpyAsync(device to host) - steam 2
cudaMemcpyAsync (host to device) - stream 3
enqueue - stream 3
cudaMemcpyAsync(device to host) - steam 3
cudaMemcpyAsync (host to device) - stream 4
enqueue - stream 4
cudaMemcpyAsync(device to host) - steam 4

Then we call cudaStreamSynchronize on stream 1.
Sometimes the output from stream 1 is incorrect, If use with usleep or export CUDA_LAUNCH_BLOCKING=1., then the result is okay.

There seems to be a race condition in cudaStreamSynchronize.

We tried on GeFore TITAN X also with driver version 430.50 and same version of tensorrt with same code, the problem does not occurred on GeFore TITAN X.

Hi @ek9852,
For Tesla T4,
CUDA ToolkitLinux x86_64 Driver VersionWindows x86_64 Driver VersionCUDA 11.0.189 RC>= 450.36.06>= 451.22
Request you to try reproducing the issue after upgrading Nvidia Driver Version, as the latest one is r450.
Please find the support matrix to check for the same.

Support Matrix :: NVIDIA Deep Learning TensorRT Documentation .

Please share your code in case if the issue persist.

Thanks!

Tesla T4
We tried on NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0

Problem still persist
need to use CUDA_LAUNCH_BLOCKING=1, otherwise the cudaStreamSynchronize does not guarantee the queue is done.

Same code base, no this problem in TITAN GPU card.