I am currently having trouble debugging some issues with running parallel streams concurrently. Basically, my C++/CUDA code works like this:
3 MemcpyH2DAsync → 6 MemcpyD2DAsync → multiple kernels execute → 3 MemcpyD2HAsync → Verify data on CPU
I want this process to run concurrently where once every second D2D transfer is complete (D2D as in copying data to different locations on the same GPU), a stream of kernels can start executing while the rest of the data is being transferred. A profile of this is shown below. The problem I’m having is that the last stream is consistently failing verification unless I slow down the process (with either events or sleep function). I had someone else check my work and all the memory buffers are sized appropriately and synchronized well with events. I thought the problem might be related to the drops in kernel execution seen in the profile below since they seem to happen whenever a D2D transfer is happening. Plus, the verification numbers are off enough to suggest this may be a memory issue. Any thoughts on what could be causing this behavior? Many thanks in advance.
GPU: NVIDIA RTX A6000
CPU: AMD Ryzen Threadripper PRO 3955WX 16-Cores
OS: RHEL 8.4