Running CUDA-Fortran on multiple GPU nodes

Hello everyone,

I am trying to run a small test program in CUDA-Fortran which will just print the number of devices available. Because I am planning to run in multiple GPU nodes which are not sharing the same memory I used MPI for the communication between the nodes. My code is the following.

program test
use mpi
use cudafor
implicit none
integer::ierr, rank,cpus,gpus

call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,rank,ierr)
call mpi_comm_size(MPI_COMM_WORLD,cpus,ierr)

write(,) gpus

call mpi_finalize(ierr)


The problem is that when I run the above code on two nodes where each node has one GPU, I am only getting gpus=1. I double checked with the scheduler of the HPC and the code is running as intended on the two nodes specified. So my guess is that there is a mistake on the code and that’s the reason I can’t get gpus=2. Do you have any ideas on how to make the above test code work on two GPU nodes.

Thank you in advance

This would be expected since cudaGetDeviceCount checks the number of devices on system. It’s a local call so has no visibility across multiple systems.

Is there another way then to check the number of devices across multiple GPU nodes where I am running the code? The plan is to use the number of devices to call the cudasetdevice() command but I guess that’s also a local call. Is there an alternative to that command, so I can set the devices on different nodes? On a second thought, will the cudasetdevice() command assign the visible device of each node (because I am using MPI) and I will finally use both of the GPUs available even though I am getting gpus=1?

Thank you and I appreciate your help here

Hi VT,

While I don’t have a CUDA Fortran version of this, below is how I do the device assignment with OpenACC. Should be easy to modify for CUDA.

Basically, I use an MPI-3 feature to get the local rank id for the ranks on a particular system, get the number of devices on the system, then use a mod operation to select which local rank gets which device.

      integer :: dev, devNum, local_rank, local_comm
      integer :: devtype   ! delete: not needed for CUDA fortran
     call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
          MPI_INFO_NULL, local_comm,ierr)
     call MPI_Comm_rank(local_comm, local_rank,ierr)

     devtype = acc_get_device_type()  ! delete: not needed for CUDA fortran
     devNum = acc_get_num_devices(devtype)  ! change this to cudaGetDeviceCount
     dev = mod(local_rank,devNum)    
     call acc_set_device_num(dev, devtype)    ! change this to cudaSetDevice


Thank you Mat. That was exactly what I needed. It works perfectly fine.