NVPHC 21.11 and beyond bug with pointer-to-pointer offloading

Hello all,

It seems there is some difference between NVHPC 21.9 and 21.11 handling of pointer-to-pointer offloading in a multi-GPU, multi-threaded context. Using the below minimal reproduction of the error found in our larger project, we observe successful GPU memory allocation and subsequent de-allocation across two devices when compiled with 21.9, but hangs with 21.11 (as well as any newer NVHPC version). It was found that both threads assigned devices, but thread 1 kept launching the kernel on GPU 0. The job showed repeated kernel launches from thread 1 to device 0. No error messages were printed, and the job was eventually timed out. This points to a regression in OpenACC runtime behavior for multi-threaded double** offloading after 21.9.

#include <cstdio>
#include <omp.h>
#include <openacc.h>

#define iN 1000
#define gs3 2000
#define nomp 2

int main() {
    printf(" Starting: double** with OpenMP (2 GPUs)\n");

    double** valS1 = new double*[iN];
    for (int i = 0; i < iN; ++i)
        valS1[i] = new double[gs3];

    #pragma omp parallel num_threads(nomp)
    {
        int tid = omp_get_thread_num();
        acc_set_device_num(tid, acc_device_nvidia);
        printf("Thread %d: assigned to GPU %d\n", tid, tid);

        printf("Thread %d: entering data...\n", tid);
        #pragma acc enter data create(valS1[0:iN][0:gs3])

        printf("Thread %d: running kernel...\n", tid);
        #pragma acc parallel loop present(valS1[0:iN][0:gs3])
        for (int i = 0; i < iN; ++i)
            for (int j = 0; j < gs3; ++j)
                valS1[i][j] = 1.0;

        printf("Thread %d: exiting data...\n", tid);
        #pragma acc exit data delete(valS1[0:iN][0:gs3])

        printf("Thread %d: done.\n", tid);
    }

    for (int i = 0; i < iN; ++i)
        delete[] valS1[i];
    delete[] valS1;

    printf("Done (but this likely hangs with 21.11!).\n");
    return 0;
}

Any insight to this issue would be greatly appreciated!

As a side note, this issue is also a continuation of a previous forum posted by a member of my group:
https://forums.developer.nvidia.com/t/unable-to-get-code-to-run-on-multiple-gpus/320032

Hi ndmeier,

I remember that issue and grateful you were able to distill it down to this simple reproducer.

It appears to be hanging in the event synchronization which I believe was first added in 21.11. I added problem report, TPR #37249 and sent it to engineering.

The work around is to avoid the default synchronization by using the “async” clause.

% cat test.cpp
#include <cstdio>
#include <openacc.h>
#include <omp.h>

#define iN 1000
#define gs3 2000
#define nomp 2

int main() {

    printf(" Starting: double** with OpenMP (2 GPUs)\n");

    double** valS1 = new double*[iN];
    for (int i = 0; i < iN; ++i)
        valS1[i] = new double[gs3];

    #pragma omp parallel num_threads(nomp)
    {
        int tid = omp_get_thread_num();

        acc_set_device_num(tid, acc_device_nvidia);
        printf("Thread %d: assigned to GPU %d\n", tid, tid);

        printf("Thread %d: entering data...\n", tid);
        #pragma acc enter data create(valS1[0:iN][0:gs3]) async(tid)

        printf("Thread %d: running kernel...\n", tid);
        #pragma acc parallel loop present(valS1[0:iN][0:gs3]) async(tid)
        for (int i = 0; i < iN; ++i) {
            for (int j = 0; j < gs3; ++j) {
              valS1[i][j] = 1.0;
            }
        }
        #pragma acc wait
        printf("Thread %d: exiting data...\n", tid);
        #pragma acc exit data delete(valS1[0:iN][0:gs3]) async(tid)
        #pragma acc wait(tid)

        printf("Thread %d: done.\n", tid);
    }

    for (int i = 0; i < iN; ++i)
        delete[] valS1[i];
    delete[] valS1;

    printf("Done (but this likely hangs with 21.11!).\n");
    return 0;
}
% nvc++ -acc -mp test.cpp -V25.3; a.out
 Starting: double** with OpenMP (2 GPUs)
Thread 1: assigned to GPU 1
Thread 1: entering data...
Thread 0: assigned to GPU 0
Thread 0: entering data...
Thread 1: running kernel...
Thread 1: exiting data...
Thread 0: running kernel...
Thread 0: exiting data...
Thread 1: done.
Thread 0: done.
Done (but this likely hangs with 21.11!).

-Mat

Hi Mat,
I appreciate your quick response! We had tried that as an initial solution, and while that fixes the sample code, our main project is designed to compute fairly complex integrals on multiple GPU’s, and changing the default synchronization in that manner unfortunately isn’t an ideal long-term solution. If there are any other workarounds you are aware of, or if engineering has more information in the coming days, we would love to know.

Again, thank you very much for your help,
-Nate

Hi Nate,

Good news. Engineering was able to fix the hang issue in our 25.5 release and I’m able to successfully run the reproducing example. Though, please let us know if you encounter and other issues when running the full code.

-Mat

% nvc++ -acc -mp test.cpp -Minfo -V25.5 ; a.out
main:
     18, #omp parallel
     27, Generating enter data create(valS1[:1000][:2000])
         Generating present(valS1[:1000][:2000])
         Generating NVIDIA GPU code
         29, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         30, #pragma acc loop seq
     30, Complex loop carried dependence of __nv_main_F1L18_1_3->->,valS1->,valS1->-> prevents parallelization
     39, Generating exit data delete(valS1[:1000][:2000])
 Starting: double** with OpenMP (2 GPUs)
Thread 0: assigned to GPU 0
Thread 0: entering data...
Thread 1: assigned to GPU 1
Thread 1: entering data...
Thread 0: running kernel...
Thread 0: exiting data...
Thread 0: done.
Thread 1: running kernel...
Thread 1: exiting data...
Thread 1: done.
Done (but this likely hangs with 21.11!).

Thank you, our group appreciates you and your team resolving this issue!

-Alex