Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
DRIVE OS 6.0.9
Target Operating System
Linux
QNX
other
Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other
SDK Manager Version
1.9.3.10904
other
Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other
Hi,
Recently I found that cudaStreamAddCallback
API would block CUDA stream for more than 100 ms on DRIVE OS 6.0.9.0, which doesn’t happen on 6.0.5.1 and desktop, here is the full reproducing code: cuda_test.txt (4.3 KB)
The key steps of this sample code are cudaMemcpyAsync
& cudaStreamAddCallback
, we add EventRecord to test the time cost between these steps:
(cudaEventRecord(events[0], stream));
(cudaMemcpyAsync(d_src, h_src, byte_size, cudaMemcpyHostToDevice, stream));
(cudaEventRecord(events[1], stream));
(cudaStreamAddCallback(stream, StreamCallback, nullptr, 0));
(cudaEventRecord(events[2], stream));
(cudaMemcpyAsync(d_dst, d_src, byte_size, cudaMemcpyDeviceToDevice, stream));
(cudaEventRecord(events[3], stream));
(cudaMemcpyAsync(h_dst, d_dst, byte_size, cudaMemcpyDeviceToHost, stream));
(cudaEventRecord(events[4], stream));
(cudaEventSynchronize(events[4]));
And we found that on 6.0.9.0, time cost of single iteration is more than 100 ms, on 6.0.5.1, it only spends several ms, seems the cudaStreamAddCallback
will stuck the whole stream even if its an empty function:
GPU durations: 100.371582 ms = 0.121728 + 100.079201 + 0.053152 + 0.117504
CPU duration: 100.383584 ms
GPU durations: 100.407608 ms = 0.121120 + 100.103424 + 0.061312 + 0.121760
CPU duration: 100.418752 ms
GPU durations: 100.423546 ms = 0.121600 + 100.133469 + 0.053728 + 0.114752
CPU duration: 100.434720 ms
GPU durations: 100.367050 ms = 0.121312 + 100.075424 + 0.052128 + 0.118176
CPU duration: 100.377760 ms
GPU durations: 100.363937 ms = 0.120928 + 100.070755 + 0.057632 + 0.114624
CPU duration: 100.374752 ms
GPU durations: 100.357857 ms = 0.121152 + 100.065025 + 0.055552 + 0.116128
Here are Nsight reports for totally same code on 6051 and 6090 system with CUDA 11.4:
cuda_callback_report.tar.gz (9.0 MB)
Please help take a look at this issue, thanks in advance!
Best regards
Zhang