OpenACC specification (version 1.0 and later) says that if the acc_on_device routine has a compile-time constant argument, it evaluates at compile time to a constant.
However, the NVHPC OpenACC compiler (nvc++ V22.11) does not convert the acc_on_device routine with a compile-time constant argument to a constant at compile time.
Engineering took a look but unfortunately it’s not possible to use “acc_on_device” within a constexpr. The problem has to do with phasing. Constexpr is evaluated by the front-end, but it isn’t until the back-end compilation when the device, host, or both code generation occurs and “acc_on_device” can be evaluated.
“acc_on_device” is a compile time constant and should work as expected if removed from the constexpr.
Thank you for the explanation.
Then, is there any nvc++ option that enforces the compiler to remove not-taken path of an if-else statement at compile time? In the above example, the device version of foo()will not execute the if-path (printf("host\n");) since acc_on_device(acc_device_host) in the device version of foo() will be evaluated to false at compile time.
In the back-end, “acc_on_device” does get turned into either true or false depending if it’s generating either device or host code. Then dead code elimination will remove either the true or false part of the conditional. No additional flags required.
For example, here’s the generated CUDA code. As you can see, there’s no if condition used:
Note that the old CUDA code generator (i.e. -gpu=nollvm,keep) isn’t supported any longer. It’s just convenient to show this. You can see the same thing in the generated LLVM code as well (i.e. -gpu=keep), it’s just harder to read.
I verified the same behavior on a local test too; thanks.
The reason why I asked the behavior of if constexpr(acc_on_device(acc_device_host)) … is because I wanted the compiler front-end to eliminate the if-path code before the device-specific back-end generates the device code.
For example, in the example below, hostfunc() is defined in an external library that works only on the host, and thus the OpenACC back-end will not be able to handle it properly.
Is there a way for nvc++ to eliminate certain code in the front-end pass depending on whether it will be executed on the host or device?
(Based on your previus explanation, it seems no, since the nvc++ front-end is oblivious of the device-specific behaviours in the backend.)
It still does this, but just does it in the back-end. You shouldn’t need constexpr.
Granted, this is a simple case, but I re-wrote your example to use a library for the host call and it works fine. Did you try this with your real code? If you tried and encountered an issue, can you post a reproducing example?
However, the execution of the compiled binary fails with the following messages:
Current file: /home/f6l/readme/SCRIPT/Examples/OpenACCExample/atomic_fetch_add.cpp
function: Z10testAtomicIjEvT
line: 34
This file was compiled: -acc=gpu -gpu=cc90 -acc=host or -acc=multicore
Rebuild this file with -gpu=cc90 to use NVIDIA Tesla GPU 0
No error occurs when compiled with -O0 or -O1 option.
(I tested with NVHPC 24.5 and 24.9)
How to fix this issue so that I can compile and run the example with -O2 or -O3?
In fact, this example is a simplified version of a real application, which is more complex and heavily templated with multiple-nested template functions. The original application fails during compilation regardless of optimization level (from -O0 to -O3), but I don’t know how to create a simplified reproducer.
I was able to recreate the issue here and filed problem report #37598. I’ve have engineering take a look.
My best guess is that at -O2, dead code elimination is getting applied so the else case is getting removed. However since the compiler has implicitly offloaded the “host_atomic_fetch_add” routine, this is causing a problem with the device code generation.
In this particular case given host_atomic_fetch_add can be offloaded, there isn’t a need to have two versions of this routine. You can have the single routine be use on both the host and device. Doesn’t really solve the issue, but could be a work around.
Thank you for this prompt response.
For now, the original application uses the single version approach for both the host and the device, but the single routine approach has several corner cases not covered. That’s why I want to find a solution to wrap both the host API and the device API.
By the way, one interesting behavior that I found is that if I change the atomic_fetch_add() implementation in the above example from