I tried to use MPI+OpenACC and !$acc set device_num( ) to set the device.
When I run the program, I receive
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
I get error results from 2 domains out of 4. I’m using a system with four V100s.
Can you provide which compiler version you’re using as well as the MPI version (i.e. are you using the OpenMPI we provide with the compilers our you’re own local build)?
Also, how are you setting the device ordering? Are you using the rank id to map to the device? Can you provide a small reproducing example showing what the code is doing?
I have not seen this error before but presume incorrect values are being used for the device number. Here’s the generic code I use when doing the rank to device mapping. It has the advantage that is uses the local rank id so works correctly when running on multiple nodes. It then round robins the device assignment.
integer :: dev, devNum, local_rank, local_comm, ierr
integer(acc_device_kind) :: devtype
! ****** Set the Accelerator device number based on local rank
call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
call MPI_Comm_rank(local_comm, local_rank,ierr)
devtype = acc_get_device_type()
devNum = acc_get_num_devices(devtype)
dev = mod(local_rank,devNum)
call acc_set_device_num(dev, devtype)
The compiler is NVIDIA HPC SDK 21.7 and I am using your boilerplate code to set the device number [Failure when using OpenACC after MPI_Init - #2 by MatColgrove]
Ok. Again, I’ve never seen this error before, but doing a web search, I believe the “devices” in the error message has to do with the net devices, not the GPUs. So it’s likely a system issue or a configuration issue.
Maybe looking at this thread may give you some ideas? Is OpenMPI supporting RDMA? · Issue #5789 · open-mpi/ompi · GitHub