CUDA_ERROR_ILLEGAL_ADDRESS with OpenMP "distribute parallel for"

I have the following example:

#include <omp.h>

#include <iostream>
#include <algorithm>

int main() {
    // Initialize
    int npts = 4096;
    float* PA = (float*)malloc(npts * sizeof(float));
    for (std::size_t i = 0; i < npts; i++) {
        PA[i] = 0.0f;
    }

    #pragma omp target teams map(to: npts) reduction(+:PA[0:npts])
    {
        int iteam = omp_get_team_num();
        if (iteam == 0) {
            int nteams = omp_get_num_teams();
            printf("Running %d teams\n", nteams);
        }

        // Without this pragma line the program works
        #pragma omp distribute parallel for
        for (std::size_t i = 0; i < npts; ++i) {
            PA[i] = 1.0f;
        }
    }

    // Print content of 10 first elements of array
    std::cout << std::endl;
    std::cout << "PA: " << std::endl;
    for (std::size_t i = 0; i < std::min(npts, 10); i++) {
        std::cout << PA[i] << std::endl;
    }
    std::cout << std::endl;

    return 0;
}

It compile (nvc++ -g -mp=gpu example.cpp) just fine, but when executed I get the following error:

Running 80 teams
Failing in Thread:1
Accelerator Fatal Error: call to cuMemcpyDtoHAsync returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: /path/to/example.cpp
 Function: main:6
 Line: 29

I am fresh to OpenMP device/GPU programming, so maybe I am trying to do something in the wrong way… I can try to explain the reason for the example:

I have an algorithm which in practice is a big reduction on an array (here: PA). Each array element receive a contribution from a huge number of places, and this is a reduction (PA[i] += something).

I am trying to initialize this by the outer omp target teams construct. This seems to start 80 teams on my GPU. Then, I want each of these 80 “teams” to distribute some work between a number of threads (compute the elements “PA[i]”) with the inner omp distribute parallel for and when this is done there shall be a big reduction on the individual “PA” from each team onto the global host array “PA”.

The global reduction to host seems to work - when the inner omp distribute parallel for is removed, each team set 1.0 in the PA array and the global array ends up having 80.0 in at the end of the program (using 80 teams). However, when using omp distribute parallel for I get the error above.

I had the impression that each thread (inside the omp distribute parallel for) of a team could access the memory of that team, is this wrong?

I tried to map(tofrom: PA) in the outer omp target teams but that did not change anything.

Can any of you help me understand this problem? Thanks.

My compiler version:

$ nvc++ --version

nvc++ 24.3-0 64-bit target on x86-64 Linux -tp alderlake 
NVIDIA Compilers and Tools
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Hi hakostra,

You’re likely encountering a known issue when using reductions with dynamically sized arrays with multiple OpenMP levels. It would be ok (with caveats) if it was a single “target teams distribute parallel for reduction…”, but this wouldn’t work with your algorithm.

Instead you can try either making “PA” a fixed size array, or use atomics instead of a reduction. For Example:

#include <omp.h>

#include <iostream>
#include <algorithm>

int main() {
    // Initialize
    const int npts = 4096;
#ifdef USE_FIXED_SIZE
    float PA[npts];
#else
    float* PA = (float*)malloc(npts * sizeof(float));
#endif
    for (std::size_t i = 0; i < npts; i++) {
        PA[i] = 0.0f;
    }

#ifdef USE_ATOMIC
    #pragma omp target teams map(to: npts) map(tofrom:PA[0:npts])
#else
    #pragma omp target teams map(to: npts) reduction(+:PA[0:npts])
#endif
    {
        int iteam = omp_get_team_num();
        if (iteam == 0) {
            int nteams = omp_get_num_teams();
            printf("Running %d teams\n", nteams);
        }

        // Without this pragma line the program works
        #pragma omp distribute parallel for
        for (std::size_t i = 0; i < npts; ++i) {
#ifdef USE_ATOMIC
    #pragma omp atomic write
#endif
            PA[i] = 1.0f;
        }
    }

    // Print content of 10 first elements of array
    std::cout << std::endl;
    std::cout << "PA: " << std::endl;
    for (std::size_t i = 0; i < std::min(npts, 10); i++) {
        std::cout << PA[i] << std::endl;
    }
    std::cout << std::endl;

    return 0;
}

Example runs:

% nvc++ -w -mp=gpu -Minfo test.cpp -DUSE_FIXED_SIZE ; a.out
main:
      3, include "iostream"
          39, include "ostream"
               38, include "ios"
                    44, include "basic_ios.h"
                         23, #omp target teams
                             23, Generating "nvkernel_main_F237L23_2" GPU kernel
                                 Generating reduction(+:PA)
                             32, Team private (i) located in CUDA shared memory
                                 #omp distribute parallel for
                               32, Loop parallelized across teams and threads, schedule(static)
     23, Generating map(to:npts)
         Generating map(tofrom:PA[:])
Running 114 teams

PA:
1
1
1
1
1
1
1
1
1
1

% nvc++ -w -mp=gpu -Minfo test.cpp -DUSE_ATOMIC ; a.out
main:
      3, include "iostream"
          39, include "ostream"
               38, include "ios"
                    44, include "basic_ios.h"
                         23, #omp target teams
                             23, Generating "nvkernel_main_F237L23_2" GPU kernel
                             32, Team private (i) located in CUDA shared memory
                                 #omp distribute parallel for
                               32, Loop parallelized across teams and threads, schedule(static)
     23, Generating map(to:npts)
         Generating map(tofrom:PA[:4096])
Running 114 teams

PA:
1
1
1
1
1
1
1
1
1
1

With dynamic arrays, the runtime needs to allocate each private array used to hold the partial reductions on the device. With multiple levels, the compiler can’t determine the size of these arrays. If you were using loops where the size is determined at compile time, you’d see the same error reported HERE. Though with distribute, the region is outlined and passed to the runtime, where the unknown bounds causes the runtime error.

Though even with the single level, array reduction with dynamically sized arrays are problematic. The device heap is relatively small so it’s easy to overflow especially with larger arrays and many threads. You can increase the heap size (via the env var NV_ACC_CUDA_HEAPSIZE), performance will suffer given device allocation is serialized. In general, I try to avoid using them.

Another option is to switch to using OpenACC given it allows for more up-front compiler analysis so the partial reduction arrays can be allocated before the kernel launch rather than during.

For example:

% cat testacc.cpp
#include <iostream>
#include <algorithm>

int main() {
    // Initialize
    const int npts = 4096;
    float* PA = (float*)malloc(npts * sizeof(float));
    for (std::size_t i = 0; i < npts; i++) {
        PA[i] = 0.0f;
    }

    #pragma acc parallel reduction(+:PA[0:npts])
    {
        #pragma acc loop gang vector
        for (std::size_t i = 0; i < npts; ++i) {
            PA[i] += 1.0f;
        }
    }

    // Print content of 10 first elements of array
    std::cout << std::endl;
    std::cout << "PA: " << std::endl;
    for (std::size_t i = 0; i < std::min(npts, 10); i++) {
        std::cout << PA[i] << std::endl;
    }
    std::cout << std::endl;

    return 0;
}
% nvc++ -acc testacc.cpp -Minfo=accel ; a.out
main:
     13, Generating NVIDIA GPU code
         15, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:PA[:4096])
     13, Local memory used for PA
         Generating implicit copy(PA[:4096]) [if not already present]

PA:
1
1
1
1
1
1
1
1
1
1

Hi and thanks for the comments. Your hint about the reduction being the problem was the key point for me to being able to rewrite the code and make it work.

I chose a bit different strategy than your suggestions, though, and since these forums is a great source of knowledge I’ll explain what I did in the end:

On the host, I found the maximum number of teams that would start, and I allocated float* PA_d = (float*)malloc(npts*nteams_max * sizeof(float));. This array was then mapped to the GPU map(tofrom: PA_d[0:nteams_max*npts]).

Inside the “omp target teams” I can create a local pointer that maps to a unique part of the PA_d array:

int iteam = omp_get_team_num();
float* PA = PA_d + iteam*npts;

Now each team can write into PA[…], and the final result is transferred back to the host where the host takes care of the final reduction from the partial results coming from each team.

Thanks for your help.