Setting the Desired Device

Hi,

I tried to use MPI+OpenACC and !$acc set device_num( ) to set the device.

When I run the program, I receive

[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.

I get error results from 2 domains out of 4. I’m using a system with four V100s.

Thanks
Ayhan

Hi Ayhan,

Can you provide which compiler version you’re using as well as the MPI version (i.e. are you using the OpenMPI we provide with the compilers our you’re own local build)?

Also, how are you setting the device ordering? Are you using the rank id to map to the device? Can you provide a small reproducing example showing what the code is doing?

I have not seen this error before but presume incorrect values are being used for the device number. Here’s the generic code I use when doing the rank to device mapping. It has the advantage that is uses the local rank id so works correctly when running on multiple nodes. It then round robins the device assignment.

#ifdef _OPENACC
      use openacc
#endif
...
#ifdef _OPENACC
      integer :: dev, devNum, local_rank, local_comm, ierr
      integer(acc_device_kind) :: devtype
#endif
...
#ifdef _OPENACC
!
! ****** Set the Accelerator device number based on local rank
!
     call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
          MPI_INFO_NULL, local_comm,ierr)
     call MPI_Comm_rank(local_comm, local_rank,ierr)
     devtype = acc_get_device_type()
     devNum = acc_get_num_devices(devtype)
     dev = mod(local_rank,devNum)
     call acc_set_device_num(dev, devtype)
# endif

-Mat

Hi Mat,

The compiler is NVIDIA HPC SDK 21.7 and I am using your boilerplate code to set the device number [Failure when using OpenACC after MPI_Init - #2 by MatColgrove]

Ayhan

Ok. Again, I’ve never seen this error before, but doing a web search, I believe the “devices” in the error message has to do with the net devices, not the GPUs. So it’s likely a system issue or a configuration issue.

Maybe looking at this thread may give you some ideas? Is OpenMPI supporting RDMA? · Issue #5789 · open-mpi/ompi · GitHub