Is unified memory (-gpu=managed) supported for OpenMP offloading (-mp=gpu)?

I am experimenting with the OpenMP target feature under nvfortran (23.7-0 64-bit target on x86-64 Linux -tp zen3). Now I am wondering whether the -gpu=managed option is compatible with -mp=gpu. I know it works well with OpenACC, but not sure about OpenMP offloading. Thanks!

Now I am wondering whether the -gpu=managed option is compatible with -mp=gpu.

Yes. All the sub-options to -gpu are shared across models with managed being the default for standard language parallelism and an opt-in for OpenACC and OpenMP.

Thank you. That is good to know.
I do see -gpu=managed as the default when selecting -stdpar, just as you say.

These questions came up in the context of comparing OpenACC vs. OpenMP vs. std language DO-CONCURRENT offloading approaches. I have a simple stencile calculation inside a 200x400 double loop. For some reason I cannot get OpenMP offloading results that are anywhere close to what I get with OpenACC or DO-CONCURRENT. I am new to OpenMP offloading, but I figured for a simple loop like this the translation from OpenACC should be straight forward… but apparently I am missing something.

The baseline performance of the compute intensive code I am investigating, executed on a single EPYC Milan core, is ~111s (with nvfortran -O2 -fast).

When instrumenting the two most expensive double loops with OpenACC, and offloading to a single A100 device, the time goes down to ~28s. This is without managed memory. Setting -gpu=managed, the time actually increases to ~90s, which I don’t fully understand either. However, I am not worried about this too much, because this is a very artificial test. In the fully instrumented OpenACC code I get very comparable (and much better) performance, by either manually ensuring data locality on the device via data acc data usage, or the managed memory option.

Just focusing on the offloading of the two isolated double loops though, I also tested with -stdpar=gpu, and I find that with the default managed memory enabled, I get ~91s. When manually disabling managed memory via -gpu=nomanaged I find ~30s, confirming that those loops, when offloaded via DO-CONCURRENT give comparable performance as with OpenACC. So far so good.

Now when trying with OpenMP for the exact same two double loops, I observe much worse performance! Without managed memory:~190s, with managed memory ~188s. Basically the same, and much worse than single CPU core, OpenACC, or DO-CONCURRENT offloading.

I realize that without actual code it is going to be difficult to diagnose. And if there is interest, I can certainly provide a lot more details. But first I wanted to see if there are maybe some high level red-flags already with what I am describing so far. Thank you again!!

Hi gjt,

Having an example would be helpful, otherwise I’m just guessing.

The performance difference of OpenACC with and without managed memory is unclear. Typically the runtime should be about the same. Also, DO CONCURRENT uses managed memory by default so my expectation would be the same issue would appear there, but is not.

Given a 200x400 stencil operation is fairly small I’m assuming you’re running many iterations?

If so, then my best guess is that something in the code is triggering the data to be copied back and forth between host and device at each iteration, while you’re not copying the data when manually managing the data. Maybe you’re printing something on the host?

For OpenMP, are you using “target teams distribute parallel do” or “target teams loop”?

“loop” is more OpenACC like in that it’s more restrictive on the constructs allowed inside it. This then allows the compiler to make better compile time decisions often leading to better optimized code.

Are you collapsing the loops? With “distribute”, the compiler is only going to do what you tell it to do. With “loop” and OpenACC, the compiler can usually do more analysis and might be finding more opportunities such as auto-collapsing or implicitly parallelizing inner loops.

-Mat

Hi Mat,
My DO CONCURRENT results are consistent with the OpenACC results. Just that I had to explicitly turn off manged memory via -gpu=nomanaged in order to get the same behavior. Plus if I turn managed memory on explicitly for OpenACC, then I get the same performance as for DO CONCURRENT by default. So these results are completely consistent.

I do run many (18,000) iterations of the 200x400 stencil operation as you suspected. I do expect quite a penalty due to data movements for this experiment, just that I was expecting it to be comparable between all three offloading techniques, but found that OpenMP seemed an outlier… but I think I just found the issue with that… and now I see also very comparable results for the OpenMP case:

It boils down to the fact that I was using the Cray ftn compiler wrapper, and apparently when using ftn to link, it works even when not specifying the -mp=gpu during linking! It even produced code that was offloaded to the GPU ( I monitor with nvidia-smi dmon). Interestingly though the code was slow, and never gets the GPU above the base clock. While with OpenACC and DO CONCURRENT I was getting code that ran at the boost rate (and also was just a lot faster in the timings observed).

Anyway, I finally ran some tests where I compiled and linked using the nvfortran front-end directly, and it complained when I did not specify -mp=gpu during linking of object files that were compiled with that option! That was when I noticed I needed it for linking, and when I did, I ended up with code that ran at practically identical performance as with OpenACC or DO CONCURRENT! I then switched back to ftn, and was able to produce the same results as long as I specified -mp=gpu at compile- and link-time.

So mystery solved. And yes, I was initially using target teams distribute parallel do, but replaced it with target teams loop. Thank you for that tip!

-Gerhard

Yes, the -gpu=managed option is compatible with -mp=gpu in nvfortran when using OpenMP offloading for GPUs. This option enables managed memory allocation for GPU arrays and works alongside OpenMP GPU directives to facilitate GPU memory management. You can use it in conjunction with OpenMP GPU directives to effectively manage GPU memory in your code