I am trying to overlap a halo/ghost-zone sharing sequence using (CUDA-aware) MPI with a computation kernel. I run the compute kernel on its own stream (call it stream1) and follow with calls to
- a device to device cudaMemcpyAsync (on a new stream, stream2)
- a kernel which packs all data to be shared over MPI into a buffer (stream2)
- MPISendrecv
- an unpack kernel (on stream2)
This cycle is more or less repeated twice per round of compute+halo share. The kernel takes .9 ms in my tests whereas the full halo sharing takes .6ms (with all needed synchronization), so I’m trying to fully hide the memory transfer, etc. The compute+halo sharing is repeated O(5-10) times, so I perform halo sharing for the next round’s compute kernel asynchronously with the current compute kernel. Hence, I synchronize the entire device before launching the next compute kernel.
If I implement no synchronization within the halo share function, the runtime is slightly longer than .9ms - so all of 1-4 are succesfully hidden. But of course, since the call to MPISendrecv needs the (device-side) data in the buffer set by the packing kernel, I need to block the host until after the pack kernel finishes so that the MPISendrecv doesn’t start immediately after the kernel launches (but before it completes). I attempted to implement this with a cudaEvent_t pack_done via
// halo packing kernel called here on stream2
cudaEventRecord(pack_done, stream2);
cudaEventSynchronize(pack_done);
// MPISendrecv called here
My understanding is that this would block the host thread from calling MPISendrecv until the kernel is done, since the calls on stream2 to the pack kernel and cudaEventRecord are synchronous (right?). At the same time, the compute kernel on stream1 would be uninterrupted. This doesn’t seem to be the case, as when I add the above event/sync calls as needed the runtime jumps to 1.25ms.
Is this the expected behavior? Isn’t it possible to overlap cudaMemcpyAsync+kernel+MPISendrecv with another kernel? (I realize the visual profiler would be helpful here but my guess is that the issue is I’m misunderstanding something fundamental about streams/events/asynchronous stuff/etc. I would like to avoid getting the visual profiler to work with a remote machine and whatnot.)