Overlapping computation with MPI communication

Hi there, I have 2 MPI processes and 2 GPUs (1 MPI task for each GPU) and I am trying to overlap a kernel (launched in a “compute_stream”) with an MPI communication (MPI_Isend) which uses CUDA_Aware_MPI (OpenMPI specifically). However, nvvp shows no overlapping between the 2 (see figure attached). As you can see there are other kernels in a “stream_exchange” and they do overlap, but the MPI_Wait all seems have to wait for both streams to be executed. My questions is: is it possible to get the overlap and how?
Many thanks,
Jony

MPI_overalapping.png