MVAPICH2 1.8a1 is released with CUDA support

The MVAPICH2 1.8a1 release is targeted for MVAPICH2 users to harness performance on InfiniBand (Mellanox) clusters with NVIDIA GPU adapters and CUDA support. The OMB 3.5 release is targeted for MPI users to carry out benchmarking and performance evaluation of MPI stacks on clusters with NVIDIA GPU adapters and CUDA support.

The feature (since MVAPICH2 1.7GA release) about CUDA support is listed here.

  • Support for MPI communication from NVIDIA GPU device memory
    • High performance RDMA-based inter-node point-to-point
      communication (GPU-GPU, GPU-Host and Host-GPU)
    • High performance intra-node point-to-point communication
      for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
    • Communication with contiguous datatype

New features and Enhancements of OSU Micro-Benchmarks (OMB) 3.5 (since OMB 3.4 release) are listed here.

  • Extension of osu_latency, osu_bw, and osu_bibw benchmarks to
    evaluate the performance of MPI_Send/MPI_Recv operation with
    NVIDIA GPU device and CUDA support
    - This functionality is exposed when configured
    with --enable-cuda option
  • Flexibility for using buffers in NVIDIA GPU device (D)
    and host memory (H)
  • Flexibility for selecting data movement between D->D,
    D->H and H->D

Sample performance numbers for MPI communication from NVIDIA GPU memory using MVAPICH2 1.8a1 and OMB 3.5 can be obtained from the following URL:

For downloading MVAPICH2 1.8a1, OMB 3.5, associated user guide, quick start guide, and accessing the SVN, please visit the following URL:

All questions, feedbacks, bug reports, hints for performance tuning, patches and enhancements are welcome. Please post it to the mvapich-discuss mailing list (



This is an excellent development towards reducing code complexity.

Does this support the peer-2-peer between processes that was introduced with CUDA 4.1? If not, is there an ETA as to when this will come?


This is great news!

As for cross-process P2P, I don’t think MVAPICH2 supports it yet. (As a side note, we’re finishing up the cross-process P2P support that we’ll check into OpenMPI in the near future.)

Great work.
You can also use the CUDA enabled MVAPICH2 from CUDA Fortran.
This is a blog with all the instructions and a working example:

Right now, MVAPICH2 1.8a1 doesn’t have this support (peer-2-peer between processes). MVAPICH2 will have it in a near future.