MultiGPU code has large overleads

Dear all,

I am developing a multiGPU code for some physics applications. Basically, I am working on the conjugate gradient method. Two GPUs split the vector, perform matrix action on their parts and share boundaries in order to proceed. In my application one MPI thread has control over one GPU. Currently I have an access to Tesla V100-SXM2-16GB residing on the same cluster node.

Each MPI thread has four CUDA streams: bulk stream, halo streams (2) and just default stream. This is needed in order to overlap computation and halos memory transfer.

While memory transfer works perfectly well, the main problem is the scalar product and the necessity to MPI_Allreduce (share and sum) result with the other processes. My NVVP profiler shows that before and after Allreduce there are huge overheads (see picture). The one after Allreduce is the launch of the next kernel. And before Allreduce some space is just blank in NVVP.

The NVVP profiler screenshot is https://yadi.sk/i/OCnveeYOkkmVXw. The Allreduce problems are near the right edge of the screenshot.

I wonder:

  1. Can I somehow overlap kernel launch configuration and MPI_Allreduce? Why kernel launch configuration starts only after Allreduce is done?
  2. What that blank space before MPI_Allreduce can possibly be?

Thanks a lot for help in advance!
Please ask for any additional information if needed.

P.S. Basically, I understand that MPI_Allredice blocks host from further execution and host can not configure the following kernel launch. However, I really need this Allreduce to finish before the next launch (for algorithm correctness). This raises another question: can I use MPI_Iallreduce (non-blocking)? If I do so, how can I organize that the host will first configure the kernel and only after that block (MPI_Waitall) to wail for the Iallreduce to finish?