Performance Issue / End of Program Dump using Stdpar

Was exploring use of “do concurrent” on gpu and encountered a couple of strange code behaviors.
I attached the sample code; notice that there is no “do concurrent” statements.

example05f_f.txt (2.1 KB)

nvfortran -g -stdpar=gpu -acc=gpu -cuda -Minfo=all -o example5f.exe example05f.f

When I compiling with (by accident) -stdpar=gpu …

  1. Code runs much slower than without this compilation flag
  2. I get a table dump at the end (but no dump when excluding this compilation flag) (no idea what the table dump is telling me)
  3. I get similar behavior when swapping out “acc parallel loop” with actual “do concurrent(1:n)”

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.9, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x424740 device:0x76c52b200200 size:288 presentcount:2+1 line:-1 name:_process_memory_16
deleted block device:0x76c52aefa000 size:512 threadid=1
FATAL ERROR: data in update host clause was not found on device 1: name=a(:)
file:/home/don/OpenACC/example05f.f simple line:37

Hi Don,

It looks like a problem with using both the “pinned” attribute and CUDA Unified Memory. Since Fortran STDPAR doesn’t have a notion of managing data, CUDA Unified Memory is enabled by default. However the program now has a conflicting direction on which memory to allocate the data. I suspect it’s getting allocated in pinned memory but UM is having to copy the data back and forth between pinned and the unified memory causing the slow-down.

I’m not sure it’s valid to use pinned with UM, but just in case, I added an issue report (TPR#36637). Even if it’s not valid, we should document it.

I’d recommend you remove “pinned” anyway and instead use the flag “-gpu=mem:separate:pinnedalloc”. “separate” says to disable UM and program will handle data movement (via the directives in this case) while “pinnedalloc” will allocate the host array in pinned memory.

If you do move to using DO CONNCURRENT and remove the OpenACC data directives, then you will need to remove this flag and instead rely on UM.

-Mat

I was using pinned attribute as I was seeing very long lag to print output owing to the time needed
to transfer arrays back to host; pinned memory is faster; managed memory seems even faster.

Can you expound on the "-gpu=mem:separate:pinnedalloc” ? I am using nvfortran 24.3 and I am
having issues with these specific compiler options mentioned. In lieu of separate, I tried nounified but unsure if it has same effect. thanks.

Can you expound on the "-gpu=mem:separate:pinnedalloc” ? I am using nvfortran 24.3

Yes, with the addition of full unified memory, all the ways to do different memory controls was getting a bit too much so in 24.5 they simplified and combined them under a single “mem” option. See: HPC Compilers User's Guide Version 24.9 for ARM, OpenPower, x86

For earlier releases the equivalent for this set would be “-gpu=pinned -gpu=nomanaged”.