Acc_on_device routine with a compile-time constant argument is not evaluated to a constant at compile time

OpenACC specification (version 1.0 and later) says that if the acc_on_device routine has a compile-time constant argument, it evaluates at compile time to a constant.
However, the NVHPC OpenACC compiler (nvc++ V22.11) does not convert the acc_on_device routine with a compile-time constant argument to a constant at compile time.

The example below fails when compiled by NVHPC:

#include <openacc.h>
#include <stdio.h>

#pragma acc routine seq
void foo() {
  if constexpr(acc_on_device(acc_device_host)) 
    printf("host\n");
  else
    printf("not host\n");
}

int main() {
  #pragma acc parallel num_gangs(1)
  {
    foo();
  }
  foo();
  return 0;
}

Thanks for the report Seyong. I recreated the issue here and have filed a report (TPR #34142). We’ll have engineering investigate.

-Mat

Hi Seyong,

Engineering took a look but unfortunately it’s not possible to use “acc_on_device” within a constexpr. The problem has to do with phasing. Constexpr is evaluated by the front-end, but it isn’t until the back-end compilation when the device, host, or both code generation occurs and “acc_on_device” can be evaluated.

“acc_on_device” is a compile time constant and should work as expected if removed from the constexpr.

-Mat

Thank you for the explanation.
Then, is there any nvc++ option that enforces the compiler to remove not-taken path of an if-else statement at compile time? In the above example, the device version of foo()will not execute the if-path (printf("host\n");) since acc_on_device(acc_device_host) in the device version of foo() will be evaluated to false at compile time.

In the back-end, “acc_on_device” does get turned into either true or false depending if it’s generating either device or host code. Then dead code elimination will remove either the true or false part of the conditional. No additional flags required.

For example, here’s the generated CUDA code. As you can see, there’s no if condition used:

% cat test.cpp
#include <openacc.h>
#include <stdio.h>

#pragma acc routine seq
void foo() {
  if (acc_on_device(acc_device_host))
    printf("host\n");
  else
    printf("not host\n");
}

int main() {
  #pragma acc parallel num_gangs(1)
  {
    foo();
  }
  foo();
  return 0;
}
% nvc++ -acc test.cpp -gpu=nollvm,keep
% cat test.n001.gpu
#include "cuda_runtime.h"
#include "nvhpc_cuda_runtime.h"
#include "test.n001.h"
extern "C" __device__ void
_Z3foov(
)
{
printf((const char*)"not host\n"); /* lilix:6 */
}
extern "C" __global__ __launch_bounds__(1) void
_8test_cpp_main_14_gpu(
)
{
_Z3foov(); /* lilix:2 */
}

Note that the old CUDA code generator (i.e. -gpu=nollvm,keep) isn’t supported any longer. It’s just convenient to show this. You can see the same thing in the generated LLVM code as well (i.e. -gpu=keep), it’s just harder to read.

I verified the same behavior on a local test too; thanks.

The reason why I asked the behavior of if constexpr(acc_on_device(acc_device_host)) … is because I wanted the compiler front-end to eliminate the if-path code before the device-specific back-end generates the device code.
For example, in the example below, hostfunc() is defined in an external library that works only on the host, and thus the OpenACC back-end will not be able to handle it properly.

#pragma acc routine seq
void foo() {
  if (acc_on_device(acc_device_host)) 
    hostfunc();
  else
    devicefunc();
}

Is there a way for nvc++ to eliminate certain code in the front-end pass depending on whether it will be executed on the host or device?
(Based on your previus explanation, it seems no, since the nvc++ front-end is oblivious of the device-specific behaviours in the backend.)

It still does this, but just does it in the back-end. You shouldn’t need constexpr.

Granted, this is a simple case, but I re-wrote your example to use a library for the host call and it works fine. Did you try this with your real code? If you tried and encountered an issue, can you post a reproducing example?

Here’s the simple test:

% cat test.cpp
#include <openacc.h>
#include <stdio.h>


#pragma acc routine seq
void foo_dev() {
    printf("not host\n");
}

void foo_host();

#pragma acc routine seq
void foo() {
  if (acc_on_device(acc_device_host))
     foo_host();
  else
     foo_dev();
}

int main() {
  #pragma acc parallel num_gangs(1)
  {
    foo();
  }
  foo();
  return 0;
}
% cat test2.cpp
#include <openacc.h>
#include <stdio.h>

void foo_host() {
    printf("host\n");
}

% nvc++ -c test2.cpp -fpic
% nvc++ -shared -o libtest2.so test2.o
% nvc++ -acc test.cpp -L./ -ltest2
% a.out
not host
host

Let’s revisit this issue; here is a more complicated example:
atomic_fetch_add.cpp.txt (1.4 KB)

I can compile this example using the following command without a compile-time error:

nvc++ -acc -Minfo=all -O3 atomic_fetch_add.cpp -o atomic_fetch_add

However, the execution of the compiled binary fails with the following messages:

Current file: /home/f6l/readme/SCRIPT/Examples/OpenACCExample/atomic_fetch_add.cpp
function: Z10testAtomicIjEvT
line: 34
This file was compiled: -acc=gpu -gpu=cc90 -acc=host or -acc=multicore
Rebuild this file with -gpu=cc90 to use NVIDIA Tesla GPU 0

No error occurs when compiled with -O0 or -O1 option.
(I tested with NVHPC 24.5 and 24.9)
How to fix this issue so that I can compile and run the example with -O2 or -O3?

In fact, this example is a simplified version of a real application, which is more complex and heavily templated with multiple-nested template functions. The original application fails during compilation regardless of optimization level (from -O0 to -O3), but I don’t know how to create a simplified reproducer.

Thanks for the follow-up Seyong.

I was able to recreate the issue here and filed problem report #37598. I’ve have engineering take a look.

My best guess is that at -O2, dead code elimination is getting applied so the else case is getting removed. However since the compiler has implicitly offloaded the “host_atomic_fetch_add” routine, this is causing a problem with the device code generation.

In this particular case given host_atomic_fetch_add can be offloaded, there isn’t a need to have two versions of this routine. You can have the single routine be use on both the host and device. Doesn’t really solve the issue, but could be a work around.

-Mat

Thank you for this prompt response.
For now, the original application uses the single version approach for both the host and the device, but the single routine approach has several corner cases not covered. That’s why I want to find a solution to wrap both the host API and the device API.
By the way, one interesting behavior that I found is that if I change the atomic_fetch_add() implementation in the above example from

    if( acc_on_device(acc_device_not_host) ) { 
        return device_atomic_fetch_add(dest, value);
    } else {
        return host_atomic_fetch_add(dest, value);
    }   

to

    if( acc_on_device(acc_device_not_host) ) { 
        return device_atomic_fetch_add(dest, value);
    }
    if( acc_on_device(acc_device_host) ) {
        return host_atomic_fetch_add(dest, value);
    }
    return NULL;

no error occurred, but this approach does not work for the original application.

Hi Seyong,

FYI, TPR #37598 was fixed in our 25.11 release.

-Mat