OpenMP offload with -gpu=nordc doesn't launch kernels on GPU

pramod.s.kumbhar · October 17, 2021, 8:42am

Hello,

We are using OpenACC in our application and taking a look at OpenMP offload features (NVHPC v21.9). I see that when -gpu=nordc option is used with OpenMP offload then kernels are not launched on GPU. This works with OpenACC. For example, a simple C example:

$ cat bug.c
#include <stdio.h>

int main (){
  int N = 100;
  float x[100];
  float y[100];

  for(int i=0; i<N; i++) {
      x[i] = i;
  }

  // #pragma acc parallel loop copyin(x[0:N]) copyout(y[0:N])
  #pragma omp target teams distribute parallel for map(to: x[0:N]) map(from: y[0:N])
  for(int i = 0; i < N; i++){
      y[i] += 3*x[i];
  }

  printf("%g\n", y[5]);
  return 0;
}

and compile as:

$ nvc -mp=gpu -gpu=nordc -Minfo=accel,mp bug.c
main:
     10, #omp target teams distribute parallel for
         10, Generating Tesla and Multicore code
             Generating "nvkernel_main_F1L10_1" GPU kernel
         16, Loop parallelized across teams and threads(128), schedule(static)
     10, Generating map(from:y[:N])
         Generating map(to:x[:N])

But when I execute, I don’t see any GPU kernel launch:

$ nvprof ./a.out
==15021== NVPROF is profiling process 15021, command: ./a.out
0
==15021== Profiling application: ./a.out
==15021== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.23%  2.4640us         1  2.4640us  2.4640us  2.4640us  [CUDA memcpy DtoH]
                   45.77%  2.0800us         1  2.0800us  2.0800us  2.0800us  [CUDA memcpy HtoD]
      API calls:   81.07%  204.67ms         2  102.34ms     315ns  204.67ms  cuDevicePrimaryCtxRetain
                   10.27%  25.937ms         1  25.937ms  25.937ms  25.937ms  cuMemHostAlloc
                    8.03%  20.276ms         1  20.276ms  20.276ms  20.276ms  cuMemAllocManaged
                    0.37%  939.25us         1  939.25us  939.25us  939.25us  cuMemAllocHost
                    0.08%  205.15us         3  68.382us  6.0140us  191.10us  cuMemAlloc
                    0.06%  155.46us        45  3.4540us     125ns  144.85us  cuDeviceGetAttribute
                    0.05%  133.04us         4  33.260us  28.169us  45.077us  cuDeviceGetName
                    0.01%  32.675us         1  32.675us  32.675us  32.675us  cuMemcpyHtoDAsync
                    0.01%  24.818us         1  24.818us  24.818us  24.818us  cuMemcpyDtoHAsync
                    0.01%  17.864us         2  8.9320us  6.9090us  10.955us  cuStreamSynchronize
                    0.00%  11.425us         4  2.8560us  1.0110us  7.3710us  cuDeviceGetPCIBusId
                    0.00%  11.071us         1  11.071us  11.071us  11.071us  cuStreamCreate
                    0.00%  10.965us         2  5.4820us  3.9420us  7.0230us  cuPointerGetAttributes
                    0.00%  7.9440us         1  7.9440us  7.9440us  7.9440us  cuEventRecord
                    0.00%  6.3070us         2  3.1530us     923ns  5.3840us  cuEventCreate
                    0.00%  4.6700us         3  1.5560us     392ns  3.3220us  cuCtxSetCurrent
                    0.00%  4.5810us        12     381ns     116ns  2.5750us  cuDeviceGet
                    0.00%  3.9870us         1  3.9870us  3.9870us  3.9870us  cuEventSynchronize
                    0.00%  1.7530us         1  1.7530us  1.7530us  1.7530us  cuCtxGetCurrent
                    0.00%  1.3970us         4     349ns     184ns     601ns  cuDeviceGetCount
                    0.00%     709ns         1     709ns     709ns     709ns  cuInit
                    0.00%     706ns         4     176ns     123ns     320ns  cuDeviceComputeCapability
                    0.00%     131ns         1     131ns     131ns     131ns  cuDriverGetVersion
 OpenACC (excl):   99.13%  26.128ms         1  26.128ms  26.128ms  26.128ms  acc_enter_data@bug.c:10
                    0.25%  67.000us         1  67.000us  67.000us  67.000us  acc_exit_data@bug.c:10
                    0.25%  66.977us         1  66.977us  66.977us  66.977us  acc_enqueue_download@bug.c:18
                    0.17%  43.732us         1  43.732us  43.732us  43.732us  acc_enqueue_upload@bug.c:10
                    0.09%  22.474us         1  22.474us  22.474us  22.474us  acc_wait@bug.c:18
                    0.06%  15.976us         1  15.976us  15.976us  15.976us  acc_wait@bug.c:10
                    0.05%  12.263us         1  12.263us  12.263us  12.263us  acc_device_init@bug.c:10
                    0.00%       0ns         2       0ns       0ns       0ns  acc_delete@bug.c:18
                    0.00%       0ns         2       0ns       0ns       0ns  acc_create@bug.c:10
                    0.00%       0ns         2       0ns       0ns       0ns  acc_alloc@bug.c:10

==15021== Unified Memory profiling result:
Total CPU Page faults: 1

Without -gpu=nordc I see expected output:

$ nvc -mp=gpu -Minfo=accel,mp bug.c
main:
     10, #omp target teams distribute parallel for
         10, Generating Tesla and Multicore code
             Generating "nvkernel_main_F1L10_1" GPU kernel
         16, Loop parallelized across teams and threads(128), schedule(static)
     10, Generating map(from:y[:N])
         Generating map(to:x[:N])

$ nvprof ./a.out
==18506== NVPROF is profiling process 18506, command: ./a.out
15
==18506== Profiling application: ./a.out
==18506== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   43.81%  4.1920us         1  4.1920us  4.1920us  4.1920us  nvkernel_main_F1L10_1
                   33.44%  3.2000us         2  1.6000us  1.3760us  1.8240us  [CUDA memcpy HtoD]
....

With OpenACC, -gpu=nordc works:

$ nvc -acc -gpu=nordc -Minfo=accel,mp bug.c
main:
     10, Generating copyin(x[:N]) [if not already present]
         Generating copyout(y[:N]) [if not already present]
         Generating Tesla code
         16, #pragma acc loop gang, vector(96) /* blockIdx.x threadIdx.x */

$ nvprof ./a.out
==19080== NVPROF is profiling process 19080, command: ./a.out
15
==19080== Profiling application: ./a.out
==19080== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.39%  4.9280us         1  4.9280us  4.9280us  4.9280us  main_10_gpu
                   24.11%  2.1450us         1  2.1450us  2.1450us  2.1450us  [CUDA memcpy DtoH]
                   20.50%  1.8240us         1  1.8240us  1.8240us  1.8240us  [CUDA memcpy HtoD]
...

Am I missing something? Or any other flag/option needs to be used with -mp=gpu -gpu=nordc?

MatColgrove · October 19, 2021, 6:55pm

It’s a known limitation. I asked engineering about it and they were working on it, but it just got pushed down the priority list. They’ll look at seeing if they can get it addressed.

Topic		Replies	Views
Nsight Compute does not detect kernel launches for OpenMP offloaded code Nsight Compute profiling	11	1523	February 28, 2023
Nvc++ OpenMP error inside llc nvc, nvc++ and nvfortran	5	1107	June 1, 2021
OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID' nvc, nvc++ and nvfortran	5	995	May 22, 2023
Enabling OpenMP offload breaks OpenACC code nvc, nvc++ and nvfortran	6	1257	December 1, 2021
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	46	February 11, 2025
Issue with locally defined classes in OpenMP offload region (since NVHPC 22.5) nvc, nvc++ and nvfortran	7	1054	March 31, 2023
OpenACC and OpenMP data interoperability nvc, nvc++ and nvfortran cuda	1	1201	June 25, 2021
OpenMP doesn't work in a templated function CUDA Programming and Performance	4	2248	September 14, 2009
NVC doesn't support nested parallel regions on CPU nvc, nvc++ and nvfortran	6	1204	August 25, 2023
Creating a shared library that utilises OpenMP offloading NVHPC 22.5 nvc, nvc++ and nvfortran	5	713	June 23, 2022

OpenMP offload with -gpu=nordc doesn't launch kernels on GPU

Related topics