NVC++-W-0155-Invalid accelerator data region: branching into or out of region is not allowed

I have the following code taken from the book “OpenACC for Programmer” for jacobi iteration


width and height are macros. A[width+2][height+2], B[width+2][height+2] and tolerance are global variables.
The I get the following error NVC+±W-0155-Invalid accelerator data region: branching into or out of region is not allowed
Plus, the code runs very slow compared to the sequential code, which takes 18 seconds. According to the book and from what I understood, the openacc_advanced function should only take about 1.0 second but it is taking 1006 seconds.
I am attaching my CMakelists.txt

Full file
jacobi_iteration_sequential.cc (3.9 KB)

FYI, I also have the libcupti.so not found ( I have export PGI_ACC_TIME=1).
Thanks for the help in advance.

Hi seshu.basava70

error NVC+±W-0155-Invalid accelerator data region: branching into or out of region is not allowed

Simple fix. Add a structured block for the data region so it’s not associated with the while loop. Something like:

#pragma acc data copy(A), create(B)
{   << add
  while (error > tolerance) {
...
    iteration++;
  }
}  <<< add

This should fix the performance problem as well since the code will no longer copy the arrays at each compute region and instead only copy before the while loop.

FYI, I also have the libcupti.so not found ( I have export PGI_ACC_TIME=1).

libcupti is the device profiler library. NV_ACC_TIME (or the older PGI_ACC_TIME) will still work without, but the timing will be measured from the host rather than the device.

To get libcupti, you’ll need to set the environment variable LD_LIBRARY_PATH to include the directory where it’s found. The exact directory will depend on which CUDA version you’re using.

For example, on my system which has a CUDA 12.2 driver, I’d set:

LD_LIBRARY_PATH=/opt/nv/Linux_x86_64/23.7/cuda/12.2/extras/CUPTI/lib64/

Adjust the path for your install and system.

Hope this helps,
Mat

Hi Mat,
Thanks for the reply. I did try this fix of adding a block. This removed the warning of nvc++ compiler but it is not speeding up. It is still taking same time as the basic version of openacc code without the #pragma acc data copy(A), create(B).

Hmm, seems ok to me. On my Skylake+v100, host serial is 11 seconds, and less than 1 on the V100

% nvc++ -Ofast jacobi_iteration_sequential.cc -acc -V23.7; a.out
 --- Iteration : 0 Error : 50 ----
 --- Iteration : 500 Error : 0.0706345 ----
 --- Iteration : 1000 Error : 0.0350626 ----
 --- Iteration : 1500 Error : 0.0232336 ----
 --- Iteration : 2000 Error : 0.0173339 ----
 --- Iteration : 2500 Error : 0.0138071 ----
 --- Iteration : 3000 Error : 0.0114576 ----
End of jacobi sequential function
 --- Iteration : 3428 Error : 0.00999902 ----
 ------ Sequential code success !! ------
Time elapsed : 11 seconds
 --- Iteration : 0 Error : 50 ----
 --- Iteration : 500 Error : 0.0706345 ----
 --- Iteration : 1000 Error : 0.0350626 ----
 --- Iteration : 1500 Error : 0.0232336 ----
 --- Iteration : 2000 Error : 0.0173339 ----
 --- Iteration : 2500 Error : 0.0138071 ----
 --- Iteration : 3000 Error : 0.0114576 ----
End of jacobi sequential function
 --- Iteration : 3428 Error : 0.00999902 ----
 ------ OpenACC advanced code success !! ------
Time elapsed : 0 seconds


This is my output. I didn’t add -0fast to the compiler, but is that necessary ?

Nothing changed even with nvc++ -Ofast jacobi_iteration_sequential.cc -acc -V23.7
I have the following installed:


image

I have an RTX 3070 GPU and AMD Ryzen 5 2600x Six-Core Processor.
Thanks for the help in advance.

No, -Ofast may help a bit but that’s not the problem. The loop at line 137 is running serially on the device.

Though given the second kernel is at line 105 in copy you posted above, I’m assuming you’ve modified the code? If so what did you do?

Try adding the flag “-Minfo=accel” to see the compiler feedback messages. This should give clues as to why the second kernel isn’t getting parallelized.

-Mat

jacobi_iteration_sequential.cc (5.3 KB)

CMakeLists.txt (324 Bytes)

I just added a basic openacc kernel without the #pragma acc data copy commands.
This the output of make

Hmm, I can’t explain the disconnect, by the code parallelizes for me:

% nvc++ jacobi_iteration_sequential1.cc -acc -V23.7 -Minfo=accel
jacobi_iteration_openacc_basic():
     91, Generating implicit copyout(B[1:1000][1:2000]) [if not already present]
         Generating implicit copyin(A[:][:]) [if not already present]
     94, Loop is parallelizable
     95, Loop is parallelizable
         Generating NVIDIA GPU code
         94, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
         95,   /* blockIdx.x threadIdx.x auto-collapsed */
    103, Generating implicit copy(A[1:1000][1:2000]) [if not already present]
         Generating implicit copyin(B[1:1000][1:2000]) [if not already present]
         Generating implicit copy(error) [if not already present]
    105, Loop is parallelizable
    106, Loop is parallelizable
         Generating NVIDIA GPU code
        105, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(max:error)
        106,   /* blockIdx.x threadIdx.x auto-collapsed */
jacobi_iteration_openacc_advanced():
    124, Generating copy(A[:][:]) [if not already present]
         Generating create(B[:][:]) [if not already present]
    128, Loop is parallelizable
    129, Loop is parallelizable
         Generating NVIDIA GPU code
        128, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
        129,   /* blockIdx.x threadIdx.x auto-collapsed */
    137, Generating implicit copy(error) [if not already present]
    139, Loop is parallelizable
    140, Loop is parallelizable
         Generating NVIDIA GPU code
        139, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(max:error)
        140,   /* blockIdx.x threadIdx.x auto-collapsed */

The issue you’re seeing is due to the compiler not being able to add the implicit max reduction for “error”. You should be able to fix this by explicitly adding the reduction.

       // error calculation
      error = 0;
#pragma acc kernels loop collapse(2) reduction(max:error)
      for (size_t iter_x = 1; iter_x < width + 1; iter_x++) {
        for (size_t iter_y = 1; iter_y < height + 1; iter_y++) {
          error = fmax(fabs(B[iter_x][iter_y] - A[iter_x][iter_y]), error);
          A[iter_x][iter_y] = B[iter_x][iter_y];
        }
      }

Okay, it worked :). Thanks a ton. But, my question is why did it work for you without the max reduction ?
I was actually planning to make another kernel with data copy and max reduction to compare this as (fully parallel_kernel) to just data copy (advanced_kernel).
I have seen other github pages as well with just data copy and no max reduction. They all seem to report just 1 second or less computation time.

Again, I’m not sure. My only theory would if the system or C++ STL (we use the g++ STL in order to support interoperability with g++) “fmax” were being presented in a way that the compiler doesn’t recognize it, then it would be treated as another function.

What OS are you using and what g++ version?

g++ version 11.4.0
OS Ubuntu 22.04

Thanks. I was able to reproduce the problem.

Looks to be a language level issue with C++17. Try adding “–std=c++11” so the C version of fmax is used and the compiler can detect it.

Thank you very much for the explanation. Just to make it clear, c++11 uses the correct fmax. I could get the 1 second computation time after removing reduction max kernel.
But if I am using the default c++17 , I can use std::max and get the same speed with no reduction max kernel ?

for this code


with -std=c++17, I got

clearly stating it cannot be parallelized.
is there a way, that I can use the std::max and std::abs and recreate the speed of 1 second by using
#pragma acc kernels and not #pragma acc kernels loop collapse(2) reduction(max: error) ?

std::max is a function call so not something the compiler could recognize to implicitly create the reduction.

Keep in mind that implicit reduction is a feature of NVHPC and not part of the OpenACC standard. If you want to be standard compliant and be able to use other compilers, it may be better to explicitly add the reduction clause.

Got it!! Thanks a ton for all the help. I learned a lot.
All the best and have a nice day!!