`cudaStreamAddCallback` seems stuck the CUDA stream for long time on 6.0.9.0

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)

DRIVE OS 6.0.4 SDK
DRIVE OS 6.0.9

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.9.3.10904
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Hi,

Recently I found that cudaStreamAddCallback API would block CUDA stream for more than 100 ms on DRIVE OS 6.0.9.0, which doesn’t happen on 6.0.5.1 and desktop, here is the full reproducing code: cuda_test.txt (4.3 KB)

The key steps of this sample code are cudaMemcpyAsync & cudaStreamAddCallback, we add EventRecord to test the time cost between these steps:

(cudaEventRecord(events[0], stream));
(cudaMemcpyAsync(d_src, h_src, byte_size, cudaMemcpyHostToDevice, stream));
(cudaEventRecord(events[1], stream));
(cudaStreamAddCallback(stream, StreamCallback, nullptr, 0));
(cudaEventRecord(events[2], stream));
(cudaMemcpyAsync(d_dst, d_src, byte_size, cudaMemcpyDeviceToDevice, stream));
(cudaEventRecord(events[3], stream));
(cudaMemcpyAsync(h_dst, d_dst, byte_size, cudaMemcpyDeviceToHost, stream));
(cudaEventRecord(events[4], stream));
(cudaEventSynchronize(events[4]));

And we found that on 6.0.9.0, time cost of single iteration is more than 100 ms, on 6.0.5.1, it only spends several ms, seems the cudaStreamAddCallback will stuck the whole stream even if its an empty function:

GPU durations: 100.371582 ms = 0.121728 + 100.079201 + 0.053152 + 0.117504
CPU duration: 100.383584 ms
GPU durations: 100.407608 ms = 0.121120 + 100.103424 + 0.061312 + 0.121760
CPU duration: 100.418752 ms
GPU durations: 100.423546 ms = 0.121600 + 100.133469 + 0.053728 + 0.114752
CPU duration: 100.434720 ms
GPU durations: 100.367050 ms = 0.121312 + 100.075424 + 0.052128 + 0.118176
CPU duration: 100.377760 ms
GPU durations: 100.363937 ms = 0.120928 + 100.070755 + 0.057632 + 0.114624
CPU duration: 100.374752 ms
GPU durations: 100.357857 ms = 0.121152 + 100.065025 + 0.055552 + 0.116128

Here are Nsight reports for totally same code on 6051 and 6090 system with CUDA 11.4:
cuda_callback_report.tar.gz (9.0 MB)

Please help take a look at this issue, thanks in advance!

Best regards

Zhang

This forum is exclusively for developers who are part of the NVIDIA DRIVE® AGX SDK Developer Program | NVIDIA Developer To post in the forum, please use an account associated with your corporate or university email address.
This helps us ensure that the forum remains a platform for verified members of the developer program.

Also, note that DRIVE OS 6.0.9 is not a devzone release. Please contact your NVIDIA representative for further guidance on support related to DRIVE OS 6.0.9 issues
Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.