Description
Even with cudaStreamSynchronize, the data retrieved from the stream is still not finished. Need to call usleep to make some delay or use export CUDA_LAUNCH_BLOCKING=1
Environment
TensorRT Version: nvcr.io/nvidia/tensorrt:20.0
-py3
GPU Type: Tesla T4 on AWS
Nvidia Driver Version: 418.87.00
CUDA Version: 11.0
CUDNN Version:
Operating System + Version: nvcr.io/nvidia/tensorrt:20.0
-py3 (host: aws prebuilt gpu ami)
:
Steps To Reproduce
We start 4 cuda stream to do inference.
cudaMemcpyAsync (host to device) - stream 1
enqueue - stream 1
cudaMemcpyAsync(device to host) - steam 1
cudaMemcpyAsync (host to device) - stream 2
enqueue - stream 2
cudaMemcpyAsync(device to host) - steam 2
cudaMemcpyAsync (host to device) - stream 3
enqueue - stream 3
cudaMemcpyAsync(device to host) - steam 3
cudaMemcpyAsync (host to device) - stream 4
enqueue - stream 4
cudaMemcpyAsync(device to host) - steam 4
Then we call cudaStreamSynchronize on stream 1.
Sometimes the output from stream 1 is incorrect, If use with usleep or export CUDA_LAUNCH_BLOCKING=1., then the result is okay.
There seems to be a race condition in cudaStreamSynchronize.
We tried on GeFore TITAN X also with driver version 430.50 and same version of tensorrt with same code, the problem does not occurred on GeFore TITAN X.