Unknown concurrency issue in CUDA/C++ program

Hi all,

I am currently having trouble debugging some issues with running parallel streams concurrently. Basically, my C++/CUDA code works like this:

3 MemcpyH2DAsync → 6 MemcpyD2DAsync → multiple kernels execute → 3 MemcpyD2HAsync → Verify data on CPU

I want this process to run concurrently where once every second D2D transfer is complete (D2D as in copying data to different locations on the same GPU), a stream of kernels can start executing while the rest of the data is being transferred. A profile of this is shown below. The problem I’m having is that the last stream is consistently failing verification unless I slow down the process (with either events or sleep function). I had someone else check my work and all the memory buffers are sized appropriately and synchronized well with events. I thought the problem might be related to the drops in kernel execution seen in the profile below since they seem to happen whenever a D2D transfer is happening. Plus, the verification numbers are off enough to suggest this may be a memory issue. Any thoughts on what could be causing this behavior? Many thanks in advance.

System Info:

GPU: NVIDIA RTX A6000

CPU: AMD Ryzen Threadripper PRO 3955WX 16-Cores

OS: RHEL 8.4

Improper synchronization from the transfer of input data to the kernel that consumes that data. Or possibly improper synchronization from the kernel to the transfer of output data to the host. The incoming data question could be independently verified using sequence numbers embedded in the data, and have the kernel check for a proper sequence number before consuming the data. Given that it only happens on the last stream, this sounds like a computer science off-by-one error.