cudaStreamSynchronize problem on Tesla T4

ek9852 · June 29, 2020, 12:27am

Description

Even with cudaStreamSynchronize, the data retrieved from the stream is still not finished. Need to call usleep to make some delay or use export CUDA_LAUNCH_BLOCKING=1

Environment

TensorRT Version: nvcr.io/nvidia/tensorrt:20.0
-py3
GPU Type: Tesla T4 on AWS
Nvidia Driver Version: 418.87.00
CUDA Version: 11.0
CUDNN Version:
Operating System + Version: nvcr.io/nvidia/tensorrt:20.0
-py3 (host: aws prebuilt gpu ami)
:

Steps To Reproduce

We start 4 cuda stream to do inference.
cudaMemcpyAsync (host to device) - stream 1
enqueue - stream 1
cudaMemcpyAsync(device to host) - steam 1
cudaMemcpyAsync (host to device) - stream 2
enqueue - stream 2
cudaMemcpyAsync(device to host) - steam 2
cudaMemcpyAsync (host to device) - stream 3
enqueue - stream 3
cudaMemcpyAsync(device to host) - steam 3
cudaMemcpyAsync (host to device) - stream 4
enqueue - stream 4
cudaMemcpyAsync(device to host) - steam 4

Then we call cudaStreamSynchronize on stream 1.
Sometimes the output from stream 1 is incorrect, If use with usleep or export CUDA_LAUNCH_BLOCKING=1., then the result is okay.

There seems to be a race condition in cudaStreamSynchronize.

We tried on GeFore TITAN X also with driver version 430.50 and same version of tensorrt with same code, the problem does not occurred on GeFore TITAN X.

AakankshaS · June 29, 2020, 5:13am

Hi @ek9852,
For Tesla T4,
CUDA ToolkitLinux x86_64 Driver VersionWindows x86_64 Driver VersionCUDA 11.0.189 RC>= 450.36.06>= 451.22
Request you to try reproducing the issue after upgrading Nvidia Driver Version, as the latest one is r450.
Please find the support matrix to check for the same.

Support Matrix :: NVIDIA Deep Learning TensorRT Documentation .

Please share your code in case if the issue persist.

Thanks!

ek9852 · July 24, 2020, 10:46pm

Tesla T4
We tried on NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0

Problem still persist
need to use CUDA_LAUNCH_BLOCKING=1, otherwise the cudaStreamSynchronize does not guarantee the queue is done.

Same code base, no this problem in TITAN GPU card.