CUDA aware MPI

Hello All,

I am trying to compile MPI+CUDA code for our new Kepler GPUs but there is a lot of data transfer that we need to do between mpi nodes. I have understood from the articles on nvidia website that a GPU to GPU transfer can be done using send and recv mpi calls (cuda aware mpi). But what about functions like mpi_allreduce, mpi_allgather ??? do they work as expected when used with GPU buffers on individual nodes. or do we need to still transfer the data to the host memory and then call mpi_allgather or mpi_allreduce.

Thanks,

Walter