In my code, I’ve been using a CUDA pinned memory buffer (allocated by cudaMallocHost call) as send buffer for MPI_Isend operation, and another pinned memory buffer as receive buffer for MPI_Irecv operation. Before MPI_Isend operation started, data to be sent to MPI peer are copied from GPU memory into given pinned memory buffer, using cudaMemcpyAsync call first, and then cudaSynchronizeStream is called just before MPI_Isend call; buffer is then not used any more until MPI_Wait called later to verify that MPI_Isend operation completed. Alike for MPI_Irecv - it is provided that at given time, corresponding buffer is used either by MPI, or CUDA (when copying received data into GPU memory). However, it turned out that DMA transfers by MPI and CUDA are still sometimes interfering each other; I tried with several MPI libraries: Intel MPI library, and MVAPICH (but compiled without CUDA support), and network fabric on given cluster is Infiniband. I was able to solve the problem through introducing additional buffers (allocated by plain malloc call) as send/receive buffers for MPI operations, and doing memcpy between these buffers and pinned memory buffers; but I still do not understand why the problem occurred at the first place. The information that I was able to find on the net about this type of problem is rather confusing, as things changed along the path of GPUdirect v1/v2/v3, and CUDA and MPI libraries updates. So - any pointers to better understand current state of affairs in this regard?
In the meantime, I re-read “An Introduction to CUDA-Aware MPI” article by Jiri Kraus (available here), then watched his corresponding GTC 2013. presentation (avalable here), and also read varoius bits discussing GPUDirect. It seems indeed that if given MPI library is not CUDA-aware, the only allowed way to transfer data between GPUs on different nodes in cluster would be to first do CUDA memcpy to cudaMalloHost allocated pinned memory, then copy from pinned memory into regular memory buffer (allocated by malloc/new), and only then call MPI_Send (that may in turn copy from regular memory buffer into another pinned buffer, this time controlled by network driver); and, of course, all this in reverse on the receiving side. The bug in my code, caused by missing to utilize intermediate regular memory buffer, was very hard to debug, and on the other side even with some years of CUDA programming experience, I was completely unaware that CUDA pinned memory must not be passed as argument to MPI_Send/MPI_Recv. So, maybe some warnings should be added to cudaMallocHost function documentation, and also to other relevant bits of CUDA docs.
This is not true, you can call MPI_Send/Receive on pinned memory. I do this routinely on large clusters.
Take a look at http://cudamusing.blogspot.com/2011/08/cuda-mpi-and-infiniband.html for some history and explanations.
(As mentioned in my first message, in my case it’s actually about MPI_Isend/MPI_Irecv instead of MPI_Send/MPI_Recv, but I guess this doesn’t make any difference.)
I read the post you mentioned above, and I don’t get further clarifications from it. It basically also says that memory allocated by cudaMallocHost is not to be passed to MPI_Send unless RDMA is disabled or, alternatively, if network stack is made CUDA-aware. It mentions CUDA_NIC_INTEROP environment variable as a possibility to fix things on CUDA side, however this variable later get deprecated, as mentioned for example in CUDA 4.1 release notes (see here). From the same release notes, it seems that having CUDA >=4.1 and kernel >=2.6.18 would do in order to be able to pass cudaMallocHost allocated pinned memory to third party drivers, and thus to MPI_Send/MPI_Recv routines.
However, as mentioned in my first message, I found that in particular case (it’s actually about Stampede machine at TACC here, CUDA version is 5.0 and kernel version is 2.6.32, but MVAPICH MPI library installation is not CUDA-aware, and they’re not using Mellanox-OFED), the arrangement I described in my first post just won’t work. The example provided in the blog post you mentioned won’t work either - if host memory allocation changed from malloc to cudaMallocHost, it would segfault at the end of the execution. So seems to me that it’s still not enough just to use newer CUDA and kernel version to be able to pass CUDA pinned memory to MPI send/receive routines, but that some level of compatibility is required from network fabric driver and/or MPI library?