It appears that when I compile with -stdpar=gpu (which turns on unified memory), my !$acc set device_num(igpu) is being ignored.
- Miko
It appears that when I compile with -stdpar=gpu (which turns on unified memory), my !$acc set device_num(igpu) is being ignored.
Hi Miko,
Do you have a reproducing example? I tried to recreate the issue with the following code, but itâs working as expected.
% cat test.F90
program main
real, allocatable, dimension(:) :: arr
integer i, devnum
allocate(arr(1024))
!$acc set device_num(2)
do concurrent (i=1:1024)
arr(i) = 1.0
enddo
print *, arr(1:10)
deallocate(arr)
end program main
% setenv NV_ACC_NOTIFY 1
% nvfortran test.F90 -stdpar=gpu -V21.11 ; a.out
launch CUDA kernel file=test.F90 function=main line=10 device=2 threadid=1 num_gangs=8 num_workers=1 vector_length=128 grid=8 block=128
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000
-Mat
Here is the code I am working with: GitHub - predsci/POT3D: POT3D: High Performance Potential Field Solver.
If I turn on unified memory with -gpu=managed, the code runs incorrectly across GPUs.
For example, if I run with âmpiexec -np 2â, it will spawn two processes on GPU 0 and 1 process on GPU 1 (3 total processes instead of 2), instead of spawning one process on each.
Ok, so âset deviceâ is working fine, itâs just youâre seeing the extra CUDA context being created?
This is because youâre allocating data prior to setting the device. Since the data is managed, a CUDA context is created on the default device, 0, for each rank. To fix, move the set device directive prior to allocation, such as just after the call to âinit_mpiâ.
Note that using the global rank id to set the device number will be problematic if you run the code over multiple nodes. Iâd suggest you use the local rank id rather than the global id as shown below. The one caveat being that this requires OpenACC API calls, so needs to include the âopenaccâ module. Iâve added preprocessing macros for portability and the flags â-Mpreprocess -accâ. Note, changing the name of the file from âpot3d.fâ to âpot3d.Fâ (upper case âFâ suffix), will have all compilers enable preprocessing by default.
-Mat
#ifdef _OPENACC
use openacc
#endif
c
c-----------------------------------------------------------------------
c
implicit none
c
c-----------------------------------------------------------------------
c
integer :: ierr,i
c
c-----------------------------------------------------------------------
c
c ****** Initialize MPI.
c
call init_mpi
c
c ****** Set the GPU device number based on rank and gpn.
c
# ifdef _OPENACC
igpu = mod(iprocsh,acc_get_num_devices(acc_get_device_type()))
!$acc set device_num(igpu)
#endif
HI,
In the code, iprocsh
is the rank number from an MPI shared communicator, so it works with multiple nodes (as it is a local rank number) without the need of the API calls.
Another question I have though is:
What is the behavior of !$acc set device_num(i)
if i
is not a valid device number?
For example, lets say I have 4 GPUs on a system, but run the code with 8 MPI ranks. The shared communicator ranks will be 0->7 but the valid device numbers are only 0->4. What happens with !$acc set device_num(5)
?
On a test I have done with 1 GPU and 2 MPI ranks, it looks like the device number was defaulted to device 0 and did indeed correctly oversubscribe the GPU (I got the right answer). Will it do this (set to 0) for any invalid device number, or will it do a mod with the total number of devices (thus distributing the ranks evenly)?
Thanks!
â Ron
Hi Ron,
In the above code snip-it, I used a âmodâ operation to effectively round-robin the device assignment. If more ranks are used than the available devices, then multiple ranks will be assigned to the same GPUs.
The OpenACC standard doesnât specify what happens when set device_num is called on a GPU id that doesnât exist. Our OpenACC runtime will round-robin (i.e. with one GPU w/ 2 ranks, setting device num=1 and 2 will have both ranks use GPU 0). Though I canât guarantee other implementations wouldnât give an error so you may want to try GNU and/or Cray before deciding what you should do. Using the mod operation with the number of devices should work no matter the compiler.
-Mat
Hi,
OK that makes sense.
We provide instructions with our code to run it with the number of MPI ranks per node equal to the number of GPUs per node so that the MPI shared communicator ranks device number selection will work correctly.
We want to avoid using the API for compatibility purposes, as we do not currently pre-process the code, so adding IFDEFs would require a good number of build script changes in the multiple packages where the code resides.
Is there any directive-only way to select the device in the way you show?
Could one be added in the future?
â Ron
The problem being that the code needs to call acc_get_num_devices and there isnât a way to return a value from a pragma.
Feel free to send a request to the OpenACC technical committee. They may be able to standardize the behavior of set device.
Hello,
I am trying to use nsight to try to profile the pot3d code, but I am having problems with the managed memory version. I use the following command:
nsys profile --stats-true mpiexec -np 4 ./pot3d
I find that the profiling works correctly when I do not have managed memory turned on, but when I compile with managed memory on and try to do an nsight profile, the code fails. I find that I get a âAborted (core dumped)â. The code runs and I get the right answer, but the profile output is nowhere to be found.
Any help would be appreciated.
â Miko
Sorry youâre having issues Miko, but unfortunately I donât know whatâs wrong. I just tried with the POT3D version I have and it seemed to work correctly.
You can try updating Nsight-Systems (I used 2021.5) and see if it something theyâve fixed (NVIDIA Nsight Systems | NVIDIA Developer)?
Hi,
I just want to clarify that this only happens when the code is compiled with âgpu=managedâ. I currently have âNVIDIA Nsight Systems version 2021.4.1.73-08591f7â
âMiko
The version of Nsight you are using has a known bug, it will crash at the end as you reported.
Get NsightSystems-linux-public-2021.5.1.118-f89f9cd.run from https://developer.download.nvidia.com/devtools/nsight-systems/.
I updated my Nsight version, and it did the trick.
Thank you .