Multi-GPU Unified Memory and Communication

Hi Mat,

I am planning to build a more powerful GPU server. I am debating between a multi-GPU system (with cheaper GPUs) and a single powerful GPU (more expensive).

Before I venture out on that, I am wondering about the following:

  1. whether the UM when using multiple devices with CC>6.x treats the multiple GPUs like a single GPU.
  2. whether GPU-GPU data communication is automatically handled by the compiler, so that I don’t have to make any changes to my single-GPU OpenACC code.

Would appreciate your input on the above. As always, if there is any literature on this, please feel free to point me to it.

Cheers,
Jyoti

Hi Jyoti,

Multiple GPUs are treated separately so you need to use MPI with each rank assigned to a particular GPU. There are other methods to support multi-GPU, but I find using MPI the easiest method and then would allow you to scale across multiple systems in the future.

CUDA Aware MPI, which does GPU direct communication, is enabled by default with the MPI versions we ship with the compilers. However, you need to pass the device pointers to the MPI calls by using an OpenACC “host_data” region. Passing UM pointers will work, but MPI wont recognize these as device pointers so wont use GPU direct. Hence if using MPI, I recommend you manually managed your data via data regions.

See the following post with the code I use for device assignment as well as links to some training:

-Mat

Wonderful! Thanks, Mat!

Oh, there’s also a class you can take which is derived from Jiri’s talks: https://developer.nvidia.com/openacc-advanced-course

Also, since the link I pointed you uses C/C++, here’s the equivalent Fortran version to do device assignment.

% cat device_assign.F90
!#######################################################################
      program device_assign
!
!-----------------------------------------------------------------------
!
      use openacc
      use mpi
      integer :: dev, devNum, local_rank, local_comm
      integer(acc_device_kind) :: devtype
      integer :: ierr, my_rank
!
!-----------------------------------------------------------------------
!
! ****** Initialize MPI.
!
      call MPI_Init(ierr)
!
! ****** Get the index (rank) of the local processor in
! ****** communicator MPI_COMM_WORLD in variable my_rank.
!
      call MPI_Comm_rank (MPI_COMM_WORLD,my_rank,ierr)
!

!
! ****** Set the Accelerator device number based on local rank
!
     call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
          MPI_INFO_NULL, local_comm,ierr)
     call MPI_Comm_rank(local_comm, local_rank,ierr)
     devtype = acc_get_device_type()
     devNum = acc_get_num_devices(devtype)
     dev = mod(local_rank,devNum)
     call acc_set_device_num(dev, devtype)
     print *, "RANK: ", my_rank, " Using device: ", dev

     call mpi_finalize(ierr)
     end program device_assign
% mpif90 -acc device_assign.F90; mpirun -np 4 a.out
 RANK:             0  Using device:             0
 RANK:             2  Using device:             0
 RANK:             1  Using device:             1
 RANK:             3  Using device:             1

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.