Multi-GPU MPI launch failing when UVM enabled

cps1 · December 21, 2018, 5:33pm

Hello all,

I’m trying to test CUDA managed memory (i.e., unified virtual memory – UVM) with OpenACC in a multi-GPU environment. The code is Fortran90 with MPI. I’m using MPI to launch 1 process per GPU and assigning each MPI rank to a unique device [0-3]. One GPU/node works fine. However, when I go to 2 GPUs/node, I occasionally get the following error:

call to cuCtxCreate returned error 101: Invalid device

from one of the MPI ranks and the job terminates. I can repeat the launch and the job will often run after a few tries. That is, the error is intermittent.

When I increase to 4 GPUs/node, the failure rate increases significantly and I can rarely get this to run successfully. When the jobs do run, the solutions are correct.

The GPU device id’s requested when I call

acc_set_device_type(gpu_id, acc_device_nvidia)

are within the range of GPU id’s returned by

acc_get_num_devices( acc_device_nvidia )

pgaccelinfo reports 4 GPUs in ‘exclusive-process’ compute mode.

When I disable managed memory and explicitly control the OpenACC device data regions I do not have this problem and can run with 4 GPUs/node w/o issue.

I’m using 18.10 but I see the same behavior with 18.7. I’m using the OpenMPI distribution that comes with the PGI release. This is on a Power8 system with 4 P100’s / node running RHEL.

I’ve seen this error reported when trying to launch multiple MPI jobs per device and the solution is to enable MPS. I’m not launching multiple MPI processes per device in this scenario so not sure if this applies. But, I did try to start the MPS daemon with nvidia-cuda-mps-control -d as a normal user but the MPI job failed when cuInit was called. (I have no root access and no access to the sys logs.) All MPI processes gave the same error:

call to cuInit returned error 999: Unknown

Am I missing something with the job launch configuration? Any help would be greatly appreciated.

Thanks in advance.

MatColgrove · December 21, 2018, 5:50pm

Hi cps,

I haven’t encountered this error before so not sure what’s wrong.

Are you setting anything else such as “CUDA_VISIBLE_DEVICES” when using UVM?

When I’ve gotten an “Invalid Device” error, it was because I used a wrong gpu_id, or when creating a unified binary (i.e. -ta=tesla:managed,multicore) and then setting “acc_set_device_type(acc_device_host)”.

Though, you say that you set the device via:

acc_set_device_type(gpu_id, acc_device_nvidia)

Is this a typo and you meant “acc_set_device_num”? “acc_set_device_type” only has one argument, the device type.

What’s full code you use to determine the mapping of ranks to devices?

What compilation flags are you using to compile?

I do find it odd that it only fails with UVM which shouldn’t be an issue with P100s.

-Mat

cps1 · December 21, 2018, 6:35pm

Mat,

Yes, that’s a typo. Sorry about that. Here’s the cut-n-paste of the actual device selection code:

      ngpus = acc_get_num_devices( acc_device_nvidia )
      if( ngpus .gt. 0 ) then
         ngpus_rt = ngpus
         call getenv('NGPUS_RT', envstr)
         if (envstr .ne. ' ') then
            read(envstr,*,iostat=ios) ngpus_rt
         endif
         gpu_id = mod( myid, min(ngpus,ngpus_rt) )
         call acc_set_device_num( gpu_id, acc_device_nvidia )
      else
         !/* no NVIDIA GPUs available */
         call acc_set_device_type( acc_device_host )
      endif

I’m using ‘-acc -ta=tesla:cc60,cuda9.2’ to compile with pgfortran (via mpifort).

I haven’t set CUDA_VISIBLE_DEVICES and when I run a little bash script launched with mpirun it’s empty (i.e., not defined).

I tried adding a delay before for each rank called acc_set_device_num and then added a MPI_Barrier (inside a loop over the # of MPI processes) so that each rank set their device atomically but that didn’t help. (Thought there may be some issue related to a contended resource.)

I’m stumped.

Thanks,
Chris

MatColgrove · December 26, 2018, 4:24pm

Hi Chris,

I’m wondering if you have “use openacc” to include the OpenACC module and hence get the interfaces to the OpenACC API calls?

If not, what could be happening is that since “acc_get_num_devices” isn’t defined properly, it’s returning a bad value. Hence, the code is setting “acc_device_host” as the device. Setting UVM and using “acc_device_host” would cause this error. Without UVM, no error but you’d still be running on the host instead of the GPU.

FYI, here’s the code I use to set ranks to devices. It uses an MPI-3 call to determine the number of local ranks and then round robins the device assignment.

#ifdef _OPENACC
    use openacc
#endif
    IMPLICIT NONE

    INTEGER :: err,rank,size
#ifdef _OPENACC
    integer :: dev, devNum, local_rank, local_comm
    integer :: devtype
#endif
    rank=0
    size=1

    CALL MPI_INIT(err)
    CALL MPI_COMM_RANK(MPI_COMM_WORLD,rank,err)
    CALL MPI_COMM_SIZE(MPI_COMM_WORLD,size,err)

#ifdef _OPENACC
    ! Set the local device
    call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
         MPI_INFO_NULL, local_comm,err)
    call MPI_Comm_rank(local_comm, local_rank,err)
    devtype = acc_get_device_type()
    devNum = acc_get_num_devices(devtype)
    dev = mod(local_rank,devNum)
    call acc_set_device_num(dev, devtype)
#endif

-Mat

cps1 · December 30, 2018, 7:05pm

Mat,

Nuts, I am including the OACC module. Still stuck when using UVM. I’ll look for a different multi-GPU system and see if this problem is specific to my platform.

And thanks for the sharing the MPI code. I’ve been recycling a very dusty method based on hostnames. Hadn’t thought about using the shared-memory communicator method for this purpose. Much cleaner and portable. Thanks again!

Chris

MatColgrove · January 2, 2019, 5:58pm

Hi Chris,

In your code, can you try making the program abort if there are no devices available instead of setting the device to the host?

There is a known issue when using either “-ta=tesla:managed” or “-ta=tesla:pinned” and then setting the device to host. The problem being that both require a CUDA context, which isn’t created when using host.

In 19.1, we’ll be adding a new API call, “acc_set_host_only()” which will work around this issue. The caveat that the program can’t then set the device type to a GPU later in the program.

-Mat

Topic		Replies	Views
Using multiple GPUs Legacy PGI Compilers	7	22076	August 11, 2009
Failure when using OpenACC after MPI_Init nvc, nvc++ and nvfortran	7	1568	April 23, 2021
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	11	984	April 26, 2023
about multi GPU control CUDA Programming and Performance	3	709	December 23, 2019
MPI mixing host and gpu devices with PGI accelerator Legacy PGI Compilers	5	3934	December 7, 2011
problem with multi gpu using mpi Legacy PGI Compilers	2	2174	December 2, 2015
Unified Memory Problem nvc, nvc++ and nvfortran	12	1181	January 12, 2022
Multi-GPU Unified Memory and Communication nvc, nvc++ and nvfortran	4	668	October 27, 2023
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	2037	March 11, 2024
How used my four gpu node Legacy PGI Compilers	6	4619	April 21, 2018

Multi-GPU MPI launch failing when UVM enabled

Related topics