NVPHC 21.11 and beyond bug with pointer-to-pointer offloading

ndmeier · March 31, 2025, 6:17pm

Hello all,

It seems there is some difference between NVHPC 21.9 and 21.11 handling of pointer-to-pointer offloading in a multi-GPU, multi-threaded context. Using the below minimal reproduction of the error found in our larger project, we observe successful GPU memory allocation and subsequent de-allocation across two devices when compiled with 21.9, but hangs with 21.11 (as well as any newer NVHPC version). It was found that both threads assigned devices, but thread 1 kept launching the kernel on GPU 0. The job showed repeated kernel launches from thread 1 to device 0. No error messages were printed, and the job was eventually timed out. This points to a regression in OpenACC runtime behavior for multi-threaded double** offloading after 21.9.

#include <cstdio>
#include <omp.h>
#include <openacc.h>

#define iN 1000
#define gs3 2000
#define nomp 2

int main() {
    printf(" Starting: double** with OpenMP (2 GPUs)\n");

    double** valS1 = new double*[iN];
    for (int i = 0; i < iN; ++i)
        valS1[i] = new double[gs3];

    #pragma omp parallel num_threads(nomp)
    {
        int tid = omp_get_thread_num();
        acc_set_device_num(tid, acc_device_nvidia);
        printf("Thread %d: assigned to GPU %d\n", tid, tid);

        printf("Thread %d: entering data...\n", tid);
        #pragma acc enter data create(valS1[0:iN][0:gs3])

        printf("Thread %d: running kernel...\n", tid);
        #pragma acc parallel loop present(valS1[0:iN][0:gs3])
        for (int i = 0; i < iN; ++i)
            for (int j = 0; j < gs3; ++j)
                valS1[i][j] = 1.0;

        printf("Thread %d: exiting data...\n", tid);
        #pragma acc exit data delete(valS1[0:iN][0:gs3])

        printf("Thread %d: done.\n", tid);
    }

    for (int i = 0; i < iN; ++i)
        delete[] valS1[i];
    delete[] valS1;

    printf("Done (but this likely hangs with 21.11!).\n");
    return 0;
}

Any insight to this issue would be greatly appreciated!

As a side note, this issue is also a continuation of a previous forum posted by a member of my group:
https://forums.developer.nvidia.com/t/unable-to-get-code-to-run-on-multiple-gpus/320032

MatColgrove · March 31, 2025, 9:06pm

Hi ndmeier,

I remember that issue and grateful you were able to distill it down to this simple reproducer.

It appears to be hanging in the event synchronization which I believe was first added in 21.11. I added problem report, TPR #37249 and sent it to engineering.

The work around is to avoid the default synchronization by using the “async” clause.

% cat test.cpp
#include <cstdio>
#include <openacc.h>
#include <omp.h>

#define iN 1000
#define gs3 2000
#define nomp 2

int main() {

    printf(" Starting: double** with OpenMP (2 GPUs)\n");

    double** valS1 = new double*[iN];
    for (int i = 0; i < iN; ++i)
        valS1[i] = new double[gs3];

    #pragma omp parallel num_threads(nomp)
    {
        int tid = omp_get_thread_num();

        acc_set_device_num(tid, acc_device_nvidia);
        printf("Thread %d: assigned to GPU %d\n", tid, tid);

        printf("Thread %d: entering data...\n", tid);
        #pragma acc enter data create(valS1[0:iN][0:gs3]) async(tid)

        printf("Thread %d: running kernel...\n", tid);
        #pragma acc parallel loop present(valS1[0:iN][0:gs3]) async(tid)
        for (int i = 0; i < iN; ++i) {
            for (int j = 0; j < gs3; ++j) {
              valS1[i][j] = 1.0;
            }
        }
        #pragma acc wait
        printf("Thread %d: exiting data...\n", tid);
        #pragma acc exit data delete(valS1[0:iN][0:gs3]) async(tid)
        #pragma acc wait(tid)

        printf("Thread %d: done.\n", tid);
    }

    for (int i = 0; i < iN; ++i)
        delete[] valS1[i];
    delete[] valS1;

    printf("Done (but this likely hangs with 21.11!).\n");
    return 0;
}
% nvc++ -acc -mp test.cpp -V25.3; a.out
 Starting: double** with OpenMP (2 GPUs)
Thread 1: assigned to GPU 1
Thread 1: entering data...
Thread 0: assigned to GPU 0
Thread 0: entering data...
Thread 1: running kernel...
Thread 1: exiting data...
Thread 0: running kernel...
Thread 0: exiting data...
Thread 1: done.
Thread 0: done.
Done (but this likely hangs with 21.11!).

-Mat

ndmeier · March 31, 2025, 10:04pm

Hi Mat,
I appreciate your quick response! We had tried that as an initial solution, and while that fixes the sample code, our main project is designed to compute fairly complex integrals on multiple GPU’s, and changing the default synchronization in that manner unfortunately isn’t an ideal long-term solution. If there are any other workarounds you are aware of, or if engineering has more information in the coming days, we would love to know.

Again, thank you very much for your help,
-Nate

MatColgrove · May 22, 2025, 5:36pm

Hi Nate,

Good news. Engineering was able to fix the hang issue in our 25.5 release and I’m able to successfully run the reproducing example. Though, please let us know if you encounter and other issues when running the full code.

-Mat

% nvc++ -acc -mp test.cpp -Minfo -V25.5 ; a.out
main:
     18, #omp parallel
     27, Generating enter data create(valS1[:1000][:2000])
         Generating present(valS1[:1000][:2000])
         Generating NVIDIA GPU code
         29, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         30, #pragma acc loop seq
     30, Complex loop carried dependence of __nv_main_F1L18_1_3->->,valS1->,valS1->-> prevents parallelization
     39, Generating exit data delete(valS1[:1000][:2000])
 Starting: double** with OpenMP (2 GPUs)
Thread 0: assigned to GPU 0
Thread 0: entering data...
Thread 1: assigned to GPU 1
Thread 1: entering data...
Thread 0: running kernel...
Thread 0: exiting data...
Thread 0: done.
Thread 1: running kernel...
Thread 1: exiting data...
Thread 1: done.
Done (but this likely hangs with 21.11!).

alstark · June 12, 2025, 5:30pm

Thank you, our group appreciates you and your team resolving this issue!

-Alex

Topic		Replies	Views
Unable to get code to run on multiple GPUs nvc, nvc++ and nvfortran	8	125	February 12, 2025
Nvc: omp parallel for in declare target subroutine does not work nvc, nvc++ and nvfortran	8	35	June 18, 2025
Creating a shared library that utilises OpenMP offloading NVHPC 22.5 nvc, nvc++ and nvfortran	5	739	June 23, 2022
Possible NVCC compiler bug Two 'breaks' in different loops :( CUDA Programming and Performance	3	5289	January 16, 2008
Performance problem launching multiple target regions nvc, nvc++ and nvfortran	20	1159	August 23, 2021
OpenMP offload with -gpu=nordc doesn't launch kernels on GPU nvc, nvc++ and nvfortran	1	731	October 19, 2021
Just released: HPC SDK 24.9 nvc, nvc++ and nvfortran	9	116	October 8, 2024
Hybrid runs on CPU and GPU - OpenACC nvc, nvc++ and nvfortran openmpi	6	1529	May 23, 2022
Incorrect GPU results with #pragma acc atomic capture nvc, nvc++ and nvfortran	11	1364	February 27, 2023
Async Issue: Dual GPU Parallel Execution Runs Sequentially nvc, nvc++ and nvfortran	9	51	January 27, 2025

NVPHC 21.11 and beyond bug with pointer-to-pointer offloading

Related topics