Asynchronous MPISencrecv+D2D copy+kernel with another kernel

I am trying to overlap a halo/ghost-zone sharing sequence using (CUDA-aware) MPI with a computation kernel. I run the compute kernel on its own stream (call it stream1) and follow with calls to

  1. a device to device cudaMemcpyAsync (on a new stream, stream2)
  2. a kernel which packs all data to be shared over MPI into a buffer (stream2)
  3. MPISendrecv
  4. an unpack kernel (on stream2)

This cycle is more or less repeated twice per round of compute+halo share. The kernel takes .9 ms in my tests whereas the full halo sharing takes .6ms (with all needed synchronization), so I’m trying to fully hide the memory transfer, etc. The compute+halo sharing is repeated O(5-10) times, so I perform halo sharing for the next round’s compute kernel asynchronously with the current compute kernel. Hence, I synchronize the entire device before launching the next compute kernel.

If I implement no synchronization within the halo share function, the runtime is slightly longer than .9ms - so all of 1-4 are succesfully hidden. But of course, since the call to MPISendrecv needs the (device-side) data in the buffer set by the packing kernel, I need to block the host until after the pack kernel finishes so that the MPISendrecv doesn’t start immediately after the kernel launches (but before it completes). I attempted to implement this with a cudaEvent_t pack_done via

// halo packing kernel called here on stream2
cudaEventRecord(pack_done, stream2);
// MPISendrecv called here

My understanding is that this would block the host thread from calling MPISendrecv until the kernel is done, since the calls on stream2 to the pack kernel and cudaEventRecord are synchronous (right?). At the same time, the compute kernel on stream1 would be uninterrupted. This doesn’t seem to be the case, as when I add the above event/sync calls as needed the runtime jumps to 1.25ms.

Is this the expected behavior? Isn’t it possible to overlap cudaMemcpyAsync+kernel+MPISendrecv with another kernel? (I realize the visual profiler would be helpful here but my guess is that the issue is I’m misunderstanding something fundamental about streams/events/asynchronous stuff/etc. I would like to avoid getting the visual profiler to work with a remote machine and whatnot.)

I attempted two things:

  1. Setting the compute stream (stream1) with the flag cudaStreamNonBlocking to be non-blocking - didn’t make a difference.
  2. Setting the copy stream (stream2) with priority -1 - this got the runtime down to 1.01ms. Seems this worked, but I’m not sure I understand why (and hence am not convinced this is the proper solution).

This may be of interest:

I believe it should be possible to coordinate the concurrency with cudaStreamSynchronize() call(s).