OpenMP offload with -gpu=nordc doesn't launch kernels on GPU

Hello,

We are using OpenACC in our application and taking a look at OpenMP offload features (NVHPC v21.9). I see that when -gpu=nordc option is used with OpenMP offload then kernels are not launched on GPU. This works with OpenACC. For example, a simple C example:

$ cat bug.c
#include <stdio.h>

int main (){
  int N = 100;
  float x[100];
  float y[100];

  for(int i=0; i<N; i++) {
      x[i] = i;
  }

  // #pragma acc parallel loop copyin(x[0:N]) copyout(y[0:N])
  #pragma omp target teams distribute parallel for map(to: x[0:N]) map(from: y[0:N])
  for(int i = 0; i < N; i++){
      y[i] += 3*x[i];
  }

  printf("%g\n", y[5]);
  return 0;
}

and compile as:

$ nvc -mp=gpu -gpu=nordc -Minfo=accel,mp bug.c
main:
     10, #omp target teams distribute parallel for
         10, Generating Tesla and Multicore code
             Generating "nvkernel_main_F1L10_1" GPU kernel
         16, Loop parallelized across teams and threads(128), schedule(static)
     10, Generating map(from:y[:N])
         Generating map(to:x[:N])

But when I execute, I don’t see any GPU kernel launch:

$ nvprof ./a.out
==15021== NVPROF is profiling process 15021, command: ./a.out
0
==15021== Profiling application: ./a.out
==15021== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.23%  2.4640us         1  2.4640us  2.4640us  2.4640us  [CUDA memcpy DtoH]
                   45.77%  2.0800us         1  2.0800us  2.0800us  2.0800us  [CUDA memcpy HtoD]
      API calls:   81.07%  204.67ms         2  102.34ms     315ns  204.67ms  cuDevicePrimaryCtxRetain
                   10.27%  25.937ms         1  25.937ms  25.937ms  25.937ms  cuMemHostAlloc
                    8.03%  20.276ms         1  20.276ms  20.276ms  20.276ms  cuMemAllocManaged
                    0.37%  939.25us         1  939.25us  939.25us  939.25us  cuMemAllocHost
                    0.08%  205.15us         3  68.382us  6.0140us  191.10us  cuMemAlloc
                    0.06%  155.46us        45  3.4540us     125ns  144.85us  cuDeviceGetAttribute
                    0.05%  133.04us         4  33.260us  28.169us  45.077us  cuDeviceGetName
                    0.01%  32.675us         1  32.675us  32.675us  32.675us  cuMemcpyHtoDAsync
                    0.01%  24.818us         1  24.818us  24.818us  24.818us  cuMemcpyDtoHAsync
                    0.01%  17.864us         2  8.9320us  6.9090us  10.955us  cuStreamSynchronize
                    0.00%  11.425us         4  2.8560us  1.0110us  7.3710us  cuDeviceGetPCIBusId
                    0.00%  11.071us         1  11.071us  11.071us  11.071us  cuStreamCreate
                    0.00%  10.965us         2  5.4820us  3.9420us  7.0230us  cuPointerGetAttributes
                    0.00%  7.9440us         1  7.9440us  7.9440us  7.9440us  cuEventRecord
                    0.00%  6.3070us         2  3.1530us     923ns  5.3840us  cuEventCreate
                    0.00%  4.6700us         3  1.5560us     392ns  3.3220us  cuCtxSetCurrent
                    0.00%  4.5810us        12     381ns     116ns  2.5750us  cuDeviceGet
                    0.00%  3.9870us         1  3.9870us  3.9870us  3.9870us  cuEventSynchronize
                    0.00%  1.7530us         1  1.7530us  1.7530us  1.7530us  cuCtxGetCurrent
                    0.00%  1.3970us         4     349ns     184ns     601ns  cuDeviceGetCount
                    0.00%     709ns         1     709ns     709ns     709ns  cuInit
                    0.00%     706ns         4     176ns     123ns     320ns  cuDeviceComputeCapability
                    0.00%     131ns         1     131ns     131ns     131ns  cuDriverGetVersion
 OpenACC (excl):   99.13%  26.128ms         1  26.128ms  26.128ms  26.128ms  acc_enter_data@bug.c:10
                    0.25%  67.000us         1  67.000us  67.000us  67.000us  acc_exit_data@bug.c:10
                    0.25%  66.977us         1  66.977us  66.977us  66.977us  acc_enqueue_download@bug.c:18
                    0.17%  43.732us         1  43.732us  43.732us  43.732us  acc_enqueue_upload@bug.c:10
                    0.09%  22.474us         1  22.474us  22.474us  22.474us  acc_wait@bug.c:18
                    0.06%  15.976us         1  15.976us  15.976us  15.976us  acc_wait@bug.c:10
                    0.05%  12.263us         1  12.263us  12.263us  12.263us  acc_device_init@bug.c:10
                    0.00%       0ns         2       0ns       0ns       0ns  acc_delete@bug.c:18
                    0.00%       0ns         2       0ns       0ns       0ns  acc_create@bug.c:10
                    0.00%       0ns         2       0ns       0ns       0ns  acc_alloc@bug.c:10

==15021== Unified Memory profiling result:
Total CPU Page faults: 1

Without -gpu=nordc I see expected output:

$ nvc -mp=gpu -Minfo=accel,mp bug.c
main:
     10, #omp target teams distribute parallel for
         10, Generating Tesla and Multicore code
             Generating "nvkernel_main_F1L10_1" GPU kernel
         16, Loop parallelized across teams and threads(128), schedule(static)
     10, Generating map(from:y[:N])
         Generating map(to:x[:N])

$ nvprof ./a.out
==18506== NVPROF is profiling process 18506, command: ./a.out
15
==18506== Profiling application: ./a.out
==18506== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   43.81%  4.1920us         1  4.1920us  4.1920us  4.1920us  nvkernel_main_F1L10_1
                   33.44%  3.2000us         2  1.6000us  1.3760us  1.8240us  [CUDA memcpy HtoD]
....

With OpenACC, -gpu=nordc works:

$ nvc -acc -gpu=nordc -Minfo=accel,mp bug.c
main:
     10, Generating copyin(x[:N]) [if not already present]
         Generating copyout(y[:N]) [if not already present]
         Generating Tesla code
         16, #pragma acc loop gang, vector(96) /* blockIdx.x threadIdx.x */

$ nvprof ./a.out
==19080== NVPROF is profiling process 19080, command: ./a.out
15
==19080== Profiling application: ./a.out
==19080== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.39%  4.9280us         1  4.9280us  4.9280us  4.9280us  main_10_gpu
                   24.11%  2.1450us         1  2.1450us  2.1450us  2.1450us  [CUDA memcpy DtoH]
                   20.50%  1.8240us         1  1.8240us  1.8240us  1.8240us  [CUDA memcpy HtoD]
...

Am I missing something? Or any other flag/option needs to be used with -mp=gpu -gpu=nordc?

It’s a known limitation. I asked engineering about it and they were working on it, but it just got pushed down the priority list. They’ll look at seeing if they can get it addressed.