Running CUDA-Fortran on multiple GPU nodes

vtsakag · March 11, 2021, 4:56am

Hello everyone,

I am trying to run a small test program in CUDA-Fortran which will just print the number of devices available. Because I am planning to run in multiple GPU nodes which are not sharing the same memory I used MPI for the communication between the nodes. My code is the following.

program test
use mpi
use cudafor
implicit none
integer::ierr, rank,cpus,gpus

call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,rank,ierr)
call mpi_comm_size(MPI_COMM_WORLD,cpus,ierr)

ierr=cudaGetDeviceCount(gpus)
write(,) gpus

call mpi_finalize(ierr)

end

The problem is that when I run the above code on two nodes where each node has one GPU, I am only getting gpus=1. I double checked with the scheduler of the HPC and the code is running as intended on the two nodes specified. So my guess is that there is a mistake on the code and that’s the reason I can’t get gpus=2. Do you have any ideas on how to make the above test code work on two GPU nodes.

Thank you in advance
VT

MatColgrove · March 11, 2021, 4:41pm

This would be expected since cudaGetDeviceCount checks the number of devices on system. It’s a local call so has no visibility across multiple systems.

vtsakag · March 11, 2021, 6:19pm

Is there another way then to check the number of devices across multiple GPU nodes where I am running the code? The plan is to use the number of devices to call the cudasetdevice() command but I guess that’s also a local call. Is there an alternative to that command, so I can set the devices on different nodes? On a second thought, will the cudasetdevice() command assign the visible device of each node (because I am using MPI) and I will finally use both of the GPUs available even though I am getting gpus=1?

Thank you and I appreciate your help here
VT

MatColgrove · March 11, 2021, 11:54pm

Hi VT,

While I don’t have a CUDA Fortran version of this, below is how I do the device assignment with OpenACC. Should be easy to modify for CUDA.

Basically, I use an MPI-3 feature to get the local rank id for the ranks on a particular system, get the number of devices on the system, then use a mod operation to select which local rank gets which device.

....
      integer :: dev, devNum, local_rank, local_comm
      integer :: devtype   ! delete: not needed for CUDA fortran
...
     call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
          MPI_INFO_NULL, local_comm,ierr)
     call MPI_Comm_rank(local_comm, local_rank,ierr)

     devtype = acc_get_device_type()  ! delete: not needed for CUDA fortran
     devNum = acc_get_num_devices(devtype)  ! change this to cudaGetDeviceCount
     dev = mod(local_rank,devNum)    
     call acc_set_device_num(dev, devtype)    ! change this to cudaSetDevice
....

-Mat

vtsakag · March 12, 2021, 12:38am

Thank you Mat. That was exactly what I needed. It works perfectly fine.

VT