cudaStreamSynchronize problem on Tesla T4


Even with cudaStreamSynchronize, the data retrieved from the stream is still not finished. Need to call usleep to make some delay or use export CUDA_LAUNCH_BLOCKING=1


TensorRT Version:
GPU Type: Tesla T4 on AWS
Nvidia Driver Version: 418.87.00
CUDA Version: 11.0
CUDNN Version:
Operating System + Version:
-py3 (host: aws prebuilt gpu ami)

Steps To Reproduce

We start 4 cuda stream to do inference.
cudaMemcpyAsync (host to device) - stream 1
enqueue - stream 1
cudaMemcpyAsync(device to host) - steam 1
cudaMemcpyAsync (host to device) - stream 2
enqueue - stream 2
cudaMemcpyAsync(device to host) - steam 2
cudaMemcpyAsync (host to device) - stream 3
enqueue - stream 3
cudaMemcpyAsync(device to host) - steam 3
cudaMemcpyAsync (host to device) - stream 4
enqueue - stream 4
cudaMemcpyAsync(device to host) - steam 4

Then we call cudaStreamSynchronize on stream 1.
Sometimes the output from stream 1 is incorrect, If use with usleep or export CUDA_LAUNCH_BLOCKING=1., then the result is okay.

There seems to be a race condition in cudaStreamSynchronize.

We tried on GeFore TITAN X also with driver version 430.50 and same version of tensorrt with same code, the problem does not occurred on GeFore TITAN X.

Hi @ek9852,
For Tesla T4,
CUDA ToolkitLinux x86_64 Driver VersionWindows x86_64 Driver VersionCUDA 11.0.189 RC>= 450.36.06>= 451.22
Request you to try reproducing the issue after upgrading Nvidia Driver Version, as the latest one is r450.
Please find the support matrix to check for the same. .

Please share your code in case if the issue persist.


Tesla T4
We tried on NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0

Problem still persist
need to use CUDA_LAUNCH_BLOCKING=1, otherwise the cudaStreamSynchronize does not guarantee the queue is done.

Same code base, no this problem in TITAN GPU card.