Unified Memory Problem

It appears that when I compile with -stdpar=gpu (which turns on unified memory), my !$acc set device_num(igpu) is being ignored.

  • Miko

Hi Miko,

Do you have a reproducing example? I tried to recreate the issue with the following code, but it’s working as expected.

% cat test.F90
program main
   real, allocatable, dimension(:) :: arr
   integer i, devnum

!$acc set device_num(2)

   do concurrent (i=1:1024)
      arr(i) = 1.0
   print *, arr(1:10)

end program main
% setenv NV_ACC_NOTIFY 1
% nvfortran test.F90 -stdpar=gpu -V21.11 ; a.out
launch CUDA kernel  file=test.F90 function=main line=10 device=2 threadid=1 num_gangs=8 num_workers=1 vector_length=128 grid=8 block=128
    1.000000        1.000000        1.000000        1.000000
    1.000000        1.000000        1.000000        1.000000
    1.000000        1.000000


Here is the code I am working with: GitHub - predsci/POT3D: POT3D: High Performance Potential Field Solver.

If I turn on unified memory with -gpu=managed, the code runs incorrectly across GPUs.

For example, if I run with “mpiexec -np 2”, it will spawn two processes on GPU 0 and 1 process on GPU 1 (3 total processes instead of 2), instead of spawning one process on each.

  • Miko

Ok, so “set device” is working fine, it’s just you’re seeing the extra CUDA context being created?

This is because you’re allocating data prior to setting the device. Since the data is managed, a CUDA context is created on the default device, 0, for each rank. To fix, move the set device directive prior to allocation, such as just after the call to ‘init_mpi’.

Note that using the global rank id to set the device number will be problematic if you run the code over multiple nodes. I’d suggest you use the local rank id rather than the global id as shown below. The one caveat being that this requires OpenACC API calls, so needs to include the “openacc” module. I’ve added preprocessing macros for portability and the flags “-Mpreprocess -acc”. Note, changing the name of the file from ‘pot3d.f’ to ‘pot3d.F’ (upper case ‘F’ suffix), will have all compilers enable preprocessing by default.


#ifdef _OPENACC
      use openacc
      implicit none
      integer :: ierr,i

c ****** Initialize MPI.
      call init_mpi
c ****** Set the GPU device number based on rank and gpn.
# ifdef _OPENACC
      igpu = mod(iprocsh,acc_get_num_devices(acc_get_device_type()))
!$acc set device_num(igpu)


In the code, iprocsh is the rank number from an MPI shared communicator, so it works with multiple nodes (as it is a local rank number) without the need of the API calls.

Another question I have though is:

What is the behavior of !$acc set device_num(i) if i is not a valid device number?

For example, lets say I have 4 GPUs on a system, but run the code with 8 MPI ranks. The shared communicator ranks will be 0->7 but the valid device numbers are only 0->4. What happens with !$acc set device_num(5)?

On a test I have done with 1 GPU and 2 MPI ranks, it looks like the device number was defaulted to device 0 and did indeed correctly oversubscribe the GPU (I got the right answer). Will it do this (set to 0) for any invalid device number, or will it do a mod with the total number of devices (thus distributing the ranks evenly)?


– Ron

Hi Ron,

In the above code snip-it, I used a “mod” operation to effectively round-robin the device assignment. If more ranks are used than the available devices, then multiple ranks will be assigned to the same GPUs.

The OpenACC standard doesn’t specify what happens when set device_num is called on a GPU id that doesn’t exist. Our OpenACC runtime will round-robin (i.e. with one GPU w/ 2 ranks, setting device num=1 and 2 will have both ranks use GPU 0). Though I can’t guarantee other implementations wouldn’t give an error so you may want to try GNU and/or Cray before deciding what you should do. Using the mod operation with the number of devices should work no matter the compiler.



OK that makes sense.

We provide instructions with our code to run it with the number of MPI ranks per node equal to the number of GPUs per node so that the MPI shared communicator ranks device number selection will work correctly.

We want to avoid using the API for compatibility purposes, as we do not currently pre-process the code, so adding IFDEFs would require a good number of build script changes in the multiple packages where the code resides.

Is there any directive-only way to select the device in the way you show?
Could one be added in the future?

– Ron

The problem being that the code needs to call acc_get_num_devices and there isn’t a way to return a value from a pragma.

Feel free to send a request to the OpenACC technical committee. They may be able to standardize the behavior of set device.


I am trying to use nsight to try to profile the pot3d code, but I am having problems with the managed memory version. I use the following command:

nsys profile --stats-true mpiexec -np 4 ./pot3d

I find that the profiling works correctly when I do not have managed memory turned on, but when I compile with managed memory on and try to do an nsight profile, the code fails. I find that I get a “Aborted (core dumped)”. The code runs and I get the right answer, but the profile output is nowhere to be found.

Any help would be appreciated.

— Miko

Sorry you’re having issues Miko, but unfortunately I don’t know what’s wrong. I just tried with the POT3D version I have and it seemed to work correctly.

You can try updating Nsight-Systems (I used 2021.5) and see if it something they’ve fixed (NVIDIA Nsight Systems | NVIDIA Developer)?


I just want to clarify that this only happens when the code is compiled with “gpu=managed”. I currently have “NVIDIA Nsight Systems version 2021.4.1.73-08591f7”


The version of Nsight you are using has a known bug, it will crash at the end as you reported.

Get NsightSystems-linux-public-2021.5.1.118-f89f9cd.run from https://developer.download.nvidia.com/devtools/nsight-systems/.

1 Like

I updated my Nsight version, and it did the trick.

Thank you .

1 Like