I have the latest version 25.5 of Nvidia HPC SDK installed:
$ nvc --version
nvc 25.5-0 64-bit target on x86-64 Linux -tp alderlake
NVIDIA Compilers and Tools
Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
I believe the following example is valid OpenMP code:
#include <stdio.h>
#include <stdlib.h>
#pragma omp begin declare target
void kernel(int cellsperdim, float* gridarr) {
int k, j, i;
int ip;
#pragma omp parallel for collapse(2) private(i, j, k, ip)
for (i = 0; i < cellsperdim; i++) {
for (j = 0; j < cellsperdim; j++) {
for (k = 0; k < cellsperdim; k++) {
ip = i * cellsperdim * cellsperdim + j * cellsperdim + k;
gridarr[ip] = (float)i + 1.0f;
}
}
}
}
#pragma omp end declare target
int main() {
int ngrids = 2;
int cellsperdim = 4;
int cellspergrid = cellsperdim*cellsperdim*cellsperdim;
int iouter, ip;
float *arr;
// Allocate and initialize array
arr = (float *)malloc(ngrids * cellspergrid * sizeof(float));
for (ip = 0; ip < ngrids*cellspergrid; ip++) {
arr[ip] = -1.0f;
}
// Offload to device using OpenMP target teams
#pragma omp target teams distribute private(ip) map(tofrom: arr[0:ngrids * cellspergrid])
for (iouter = 0; iouter < ngrids; iouter++) {
ip = iouter * cellspergrid;
kernel(cellsperdim, &arr[ip]);
}
// Print results
for (ip = 0; ip < ngrids*cellspergrid; ip++) {
printf("%.1f ", arr[ip]);
}
printf("\n");
// Free allocated memory
free(arr);
return 0;
}
This program works with both gcc, clang and Cray cc, both with and without offloading. nvc compiles the program without issues (nvc -mp=gpu program.c) and running it without offloading gives the expected results:
$ OMP_TARGET_OFFLOAD=disabled ./a.out
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
but using offloading it fails:
$ OMP_TARGET_OFFLOAD=mandatory ./a.out
-1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
Removing the pragma omp parallel for in the kernel makes the program print the correct output also in case of using offloading (but that defeat the point of the program).
I am using an Nvidia RTX 4080 Super, except with the Cray compiler where I am on a different, shared HPC system with different hardware.
Am I missing something important here or is this a bug in the compiler? Thanks in advance for any hints.