Multi-GPU MPI launch failing when UVM enabled

Hello all,

I’m trying to test CUDA managed memory (i.e., unified virtual memory – UVM) with OpenACC in a multi-GPU environment. The code is Fortran90 with MPI. I’m using MPI to launch 1 process per GPU and assigning each MPI rank to a unique device [0-3]. One GPU/node works fine. However, when I go to 2 GPUs/node, I occasionally get the following error:

call to cuCtxCreate returned error 101: Invalid device

from one of the MPI ranks and the job terminates. I can repeat the launch and the job will often run after a few tries. That is, the error is intermittent.

When I increase to 4 GPUs/node, the failure rate increases significantly and I can rarely get this to run successfully. When the jobs do run, the solutions are correct.

The GPU device id’s requested when I call

acc_set_device_type(gpu_id, acc_device_nvidia)

are within the range of GPU id’s returned by

acc_get_num_devices( acc_device_nvidia )

pgaccelinfo reports 4 GPUs in ‘exclusive-process’ compute mode.

When I disable managed memory and explicitly control the OpenACC device data regions I do not have this problem and can run with 4 GPUs/node w/o issue.

I’m using 18.10 but I see the same behavior with 18.7. I’m using the OpenMPI distribution that comes with the PGI release. This is on a Power8 system with 4 P100’s / node running RHEL.

I’ve seen this error reported when trying to launch multiple MPI jobs per device and the solution is to enable MPS. I’m not launching multiple MPI processes per device in this scenario so not sure if this applies. But, I did try to start the MPS daemon with nvidia-cuda-mps-control -d as a normal user but the MPI job failed when cuInit was called. (I have no root access and no access to the sys logs.) All MPI processes gave the same error:

call to cuInit returned error 999: Unknown

Am I missing something with the job launch configuration? Any help would be greatly appreciated.

Thanks in advance.

Hi cps,

I haven’t encountered this error before so not sure what’s wrong.

Are you setting anything else such as “CUDA_VISIBLE_DEVICES” when using UVM?

When I’ve gotten an “Invalid Device” error, it was because I used a wrong gpu_id, or when creating a unified binary (i.e. -ta=tesla:managed,multicore) and then setting “acc_set_device_type(acc_device_host)”.

Though, you say that you set the device via:

acc_set_device_type(gpu_id, acc_device_nvidia)

Is this a typo and you meant “acc_set_device_num”? “acc_set_device_type” only has one argument, the device type.

What’s full code you use to determine the mapping of ranks to devices?

What compilation flags are you using to compile?


I do find it odd that it only fails with UVM which shouldn’t be an issue with P100s.

-Mat

Mat,

Yes, that’s a typo. Sorry about that. Here’s the cut-n-paste of the actual device selection code:

      ngpus = acc_get_num_devices( acc_device_nvidia )
      if( ngpus .gt. 0 ) then
         ngpus_rt = ngpus
         call getenv('NGPUS_RT', envstr)
         if (envstr .ne. ' ') then
            read(envstr,*,iostat=ios) ngpus_rt
         endif
         gpu_id = mod( myid, min(ngpus,ngpus_rt) )
         call acc_set_device_num( gpu_id, acc_device_nvidia )
      else
         !/* no NVIDIA GPUs available */
         call acc_set_device_type( acc_device_host )
      endif

I’m using ‘-acc -ta=tesla:cc60,cuda9.2’ to compile with pgfortran (via mpifort).

I haven’t set CUDA_VISIBLE_DEVICES and when I run a little bash script launched with mpirun it’s empty (i.e., not defined).

I tried adding a delay before for each rank called acc_set_device_num and then added a MPI_Barrier (inside a loop over the # of MPI processes) so that each rank set their device atomically but that didn’t help. (Thought there may be some issue related to a contended resource.)

I’m stumped.

Thanks,
Chris

Hi Chris,

I’m wondering if you have “use openacc” to include the OpenACC module and hence get the interfaces to the OpenACC API calls?

If not, what could be happening is that since “acc_get_num_devices” isn’t defined properly, it’s returning a bad value. Hence, the code is setting “acc_device_host” as the device. Setting UVM and using “acc_device_host” would cause this error. Without UVM, no error but you’d still be running on the host instead of the GPU.

FYI, here’s the code I use to set ranks to devices. It uses an MPI-3 call to determine the number of local ranks and then round robins the device assignment.

#ifdef _OPENACC
    use openacc
#endif
    IMPLICIT NONE

    INTEGER :: err,rank,size
#ifdef _OPENACC
    integer :: dev, devNum, local_rank, local_comm
    integer :: devtype
#endif
    rank=0
    size=1

    CALL MPI_INIT(err)
    CALL MPI_COMM_RANK(MPI_COMM_WORLD,rank,err)
    CALL MPI_COMM_SIZE(MPI_COMM_WORLD,size,err)

#ifdef _OPENACC
    ! Set the local device
    call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
         MPI_INFO_NULL, local_comm,err)
    call MPI_Comm_rank(local_comm, local_rank,err)
    devtype = acc_get_device_type()
    devNum = acc_get_num_devices(devtype)
    dev = mod(local_rank,devNum)
    call acc_set_device_num(dev, devtype)
#endif

-Mat

Mat,

Nuts, I am including the OACC module. Still stuck when using UVM. I’ll look for a different multi-GPU system and see if this problem is specific to my platform.

And thanks for the sharing the MPI code. I’ve been recycling a very dusty method based on hostnames. Hadn’t thought about using the shared-memory communicator method for this purpose. Much cleaner and portable. Thanks again!

Chris

Hi Chris,

In your code, can you try making the program abort if there are no devices available instead of setting the device to the host?

There is a known issue when using either “-ta=tesla:managed” or “-ta=tesla:pinned” and then setting the device to host. The problem being that both require a CUDA context, which isn’t created when using host.

In 19.1, we’ll be adding a new API call, “acc_set_host_only()” which will work around this issue. The caveat that the program can’t then set the device type to a GPU later in the program.

-Mat