Nvc: omp parallel for in declare target subroutine does not work

I have the latest version 25.5 of Nvidia HPC SDK installed:

$ nvc --version

nvc 25.5-0 64-bit target on x86-64 Linux -tp alderlake 
NVIDIA Compilers and Tools
Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

I believe the following example is valid OpenMP code:

#include <stdio.h>
#include <stdlib.h>


#pragma omp begin declare target
void kernel(int cellsperdim, float* gridarr) {
    int k, j, i;
    int ip;

    #pragma omp parallel for collapse(2) private(i, j, k, ip)
    for (i = 0; i < cellsperdim; i++) {
        for (j = 0; j < cellsperdim; j++) {
            for (k = 0; k < cellsperdim; k++) {
                ip = i * cellsperdim * cellsperdim + j * cellsperdim + k;
                gridarr[ip] = (float)i + 1.0f;
            }
        }
    }
}
#pragma omp end declare target


int main() {
    int ngrids = 2;
    int cellsperdim = 4;
    int cellspergrid = cellsperdim*cellsperdim*cellsperdim;
    int iouter, ip;
    float *arr;

    // Allocate and initialize array
    arr = (float *)malloc(ngrids * cellspergrid * sizeof(float));
    for (ip = 0; ip < ngrids*cellspergrid; ip++) {
        arr[ip] = -1.0f;
    }

    // Offload to device using OpenMP target teams
    #pragma omp target teams distribute private(ip) map(tofrom: arr[0:ngrids * cellspergrid])
    for (iouter = 0; iouter < ngrids; iouter++) {
        ip = iouter * cellspergrid;
        kernel(cellsperdim, &arr[ip]);
    }

    // Print results
    for (ip = 0; ip < ngrids*cellspergrid; ip++) {
        printf("%.1f  ", arr[ip]);
    }
    printf("\n");

    // Free allocated memory
    free(arr);

    return 0;
}

This program works with both gcc, clang and Cray cc, both with and without offloading. nvc compiles the program without issues (nvc -mp=gpu program.c) and running it without offloading gives the expected results:

$ OMP_TARGET_OFFLOAD=disabled ./a.out 
1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0

but using offloading it fails:

$ OMP_TARGET_OFFLOAD=mandatory ./a.out 
-1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0  -1.0

Removing the pragma omp parallel for in the kernel makes the program print the correct output also in case of using offloading (but that defeat the point of the program).

I am using an Nvidia RTX 4080 Super, except with the Cray compiler where I am on a different, shared HPC system with different hardware.

Am I missing something important here or is this a bug in the compiler? Thanks in advance for any hints.

Hi hakostra,

I don’t think you can use the “parallel” on the orphaned loop. I’m not sure if it’s specifically against the OpenMP standard, but it’s creating a new parallel region. Not an issue if the parallel region is visible within the outer loop, but within a subroutine the compiler doesn’t have visibility so must create an implicit parallel region as part of the outer loop. Having “parallel” on the orphaned region, it would need to use nested parallelism. The fix is to use just “for” to distribute the parallel threads.

I tested your code with both gcc 15.1 and nvc 25.5 and both fail. nvc doesn’t compute the loop and gcc gets a runtime error. Though if I remove “parallel”, both run correctly and give valid answers.

Example:

% diff test.c test2.c
9,10c9
<
<     #pragma omp parallel for collapse(2) private(i, j, k, ip)
---
>     #pragma omp for collapse(2) private(i, j, k, ip)
% gcc -fopenmp -fopenmp -foffload=nvptx-none -foffload-options="-march=sm_80" test.c; a.out

libgomp: Link error log ptxas application ptx input, line 73; error   : Illegal operand type to instruction 'ld'
ptxas application ptx input, line 96; error   : Illegal operand type to instruction 'ld'
ptxas application ptx input, line 73; error   : Unknown symbol '__stack_chk_guard'
ptxas application ptx input, line 96; error   : Unknown symbol '__stack_chk_guard'
ptxas fatal   : Ptx assembly aborted due to errors


libgomp: cuLinkAddData (ptx_code) error: a PTX JIT compilation failed

libgomp: Cannot map target functions or variables (expected 1 + 0 + 1, have 4294967295)
% gcc -fopenmp -fopenmp -foffload=nvptx-none -foffload-options="-march=sm_80" test2.c ; a.out
1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0
% nvc -mp=gpu test2.c ; a.out                                                          
1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.0

-Mat

Hi and thanks for the very rapid feedback.

I am unsure if I completely understand the concepts involved here…

My use case is that I have a solver that operate on a huge number of independent Cartesian grids, modelled by the outer loop in the main program. There can be thousands of grids, and this is to be the outer level of parallelism, mapping to OpenMP teams, each team member compute a number of grids (omp target teams distribute).

The second level of parallelism is to distribute the (k, j, i) index space of the grids between the threads within each team.

My understanding was that omp parallel was required to initialize the parallel threads in which omp for utilizes. With your updated example, there are no parallel anywhere - is this still using more than just the outer level of parallelism?

An alternative would of course be to have everything in the same function - but I am afraid that that will be very complicated to manage.

Thanks for the help!

I found that the Intel compiler documentation has a section on something they call “orphaned pragmas” (second header from top). The example is probably close to what I need, but there is a omp parallel in the scope above, ref. my question on the usage of parallel in my previous post.

I would put this into the category of a requirement of the implementation supporting offload to GPUs. The Intel example is for CPU multicore.

Under to hood, an OpenMP “team” will map to a CUDA block and a “thread” will map to a CUDA thread within the block. The schedule, i.e the number of blocks and threads per block, is defined when launching a kernel. The “target” region defines the beginning and end of the kernel, so the schedule is set at the start of the target region.

“parallel” basically tells when to create the threads within a team. But here this is inside a subroutine when the threads are already created. So basically its saying to have each thread spawn more threads, which can’t be done on the GPU unless nested parallelism is used, i.e. when one kernel launches another kernel from the device. Though we don’t support nest parallelism since we haven’t found a compelling use case, nor is it likely the behavior you’d want.

It works when just “for” is used since this says how to distribute the work amongst the threads, which are already created.

Now what actually works better with our compiler is to use the “loop” directive with the “bind” clause rather than “distribute” and “parallel for”. While “loop” is more restrictive than “distribute”, it allows the compiler to make more decisions at compile time, rather than runtime.

Something like the following:

#include <stdio.h>
#include <stdlib.h>


#pragma omp begin declare target
void kernel(int cellsperdim, float* gridarr) {
    int k, j, i;
    int ip;
    #pragma omp loop bind(parallel) collapse(2) private(i, j, k, ip)
    for (i = 0; i < cellsperdim; i++) {
        for (j = 0; j < cellsperdim; j++) {
            for (k = 0; k < cellsperdim; k++) {
                ip = i * cellsperdim * cellsperdim + j * cellsperdim + k;
                gridarr[ip] = (float)i + 1.0f;
            }
        }
    }
}
#pragma omp end declare target


int main() {
    int ngrids = 2;
    int cellsperdim = 4;
    int cellspergrid = cellsperdim*cellsperdim*cellsperdim;
    int iouter, ip;
    float *arr;

    // Allocate and initialize array
    arr = (float *)malloc(ngrids * cellspergrid * sizeof(float));
    for (ip = 0; ip < ngrids*cellspergrid; ip++) {
        arr[ip] = -1.0f;
    }

    // Offload to device using OpenMP target teams
    #pragma omp target teams loop bind(teams) private(ip) map(tofrom: arr[0:ngrids * cellspergrid])
    for (iouter = 0; iouter < ngrids; iouter++) {
        ip = iouter * cellspergrid;
        kernel(cellsperdim, &arr[ip]);
    }

    // Print results
    for (ip = 0; ip < ngrids*cellspergrid; ip++) {
        printf("%.1f  ", arr[ip]);
    }
    printf("\n");

    // Free allocated memory
    free(arr);

    return 0;
}

Thank you very much for your help and advice. We will experiment with variations around this and see how it works on various systems.

Hi, I have tried to run the example with the suggested omp loop bind(teams)in the main function and omp loop bind(parallel) in the kernel function. Running it with nvc 25.5 gives the expected correct result on my machine. I have increased ngrids and cellsperdim a bit and Nsight shows me reasonable grid and block dimensions for the kernel, so everything seems correct.
I don’t really understand the bind(…) clause in detail but I thought why not try to use bind(thread) in the kernel. This also works but with nvc the BlockXYZ is now (1 1 1), so with my understanding we are offloading to the teams but in each team there is only one thread which has to work on a full triple loop as defined in the kernel function (please correct me if I am wrong).

As of now everything is more or less as expected. But if I try to run this code with gcc-13, then using omp loop bind(parallel) in the kernel makes the program stall infinitely. Worse, if I run the program with omp loop bind(thread)in the kernel, then I get the expected result AND I can see in Nsight that the block extents are no longer (1 1 1) but now have a much more larger extent of (32 8 1), so my understanding of bind(thread) from above does not hold for gcc-13.

Additionally I tried to compile the code using Cray cc, but the compiler crashes badly with: Aborted (core dumped).
Compiling an equivalent example in Fortran with Cray ftn works, but the results are wrong when using omp loop bind(parallel) in the kernel. The program only executes one loop iteration in each team.
When I compile this Fortran code with omp loop bind(thread) in the kernel, then I get correct results again, the extends of the blocks are also not (1 1 1) but much bigger.

So my conclusion is that nvc seems to have the correct behavior if my understanding of the bind clause is correct, but all other compilers I have tested have correct behavior ONLY with omp loop bind(thread) in the kernel. The other compilers result in stalling programs or show wrong behavior with omp loop bind(parallel) in the kernel.

@MatColgrove Can you elaborate on the bind(...) clause since I don’t really understand the description in the OpenMP spec, especially when splitting up loops into multiple functions like the example in this thread? And do you know if the NVIDIA compilers are ahead in development regarding the loop construct than others? I don’t really know what to make of the results I have got since there is so much different behavior between compilers.

Hi j.fuchs and welcome,

I think what you’re seeing is differences in implementation decisions. For NVHPC, we map “teams” to CUDA blocks, “parallel” to the threads in each block, then “thread” to a single thread (i.e. serial which is why you only see only a single thread in this case).

Now I’m not an expert in GNU’s implementation so may be off a bit, but from what I can tell, they also map “teams” to blocks, but “simd” to threads in each block. So they may be using “bind(threads)” to “simd”. I’m not sure what they do with “parallel” nor how to define a sequential loop.

For Cray, I have very little experience with. Though I thought they did the same thing as us for “target teams distribute parallel for”, and don’t use “simd” on the device. But maybe they follow GNU for “loop”?

From a user perspective, it’s not great that each compiler implements things differently, but it’s the way it is, sorry.

I will say that this is a bit of a corner case in that “bind” is only really necessary in these type of orphaned loops. If this loop was in the main body of the target region, you could just use “loop” without a bind and each compiler should be able to map it correctly.

-Mat

Thank you for the quick answer, that clears up my confusion a bit :)