NVC++-W-0155-Invalid accelerator data region: branching into or out of region is not allowed

seshu.basava70 · August 23, 2023, 10:38pm

I have the following code taken from the book “OpenACC for Programmer” for jacobi iteration

width and height are macros. A[width+2][height+2], B[width+2][height+2] and tolerance are global variables.
The I get the following error NVC+±W-0155-Invalid accelerator data region: branching into or out of region is not allowed
Plus, the code runs very slow compared to the sequential code, which takes 18 seconds. According to the book and from what I understood, the openacc_advanced function should only take about 1.0 second but it is taking 1006 seconds.
I am attaching my CMakelists.txt

Full file
jacobi_iteration_sequential.cc (3.9 KB)

FYI, I also have the libcupti.so not found ( I have export PGI_ACC_TIME=1).
Thanks for the help in advance.

MatColgrove · August 24, 2023, 3:35pm

Hi seshu.basava70

error NVC+±W-0155-Invalid accelerator data region: branching into or out of region is not allowed

Simple fix. Add a structured block for the data region so it’s not associated with the while loop. Something like:

#pragma acc data copy(A), create(B)
{   << add
  while (error > tolerance) {
...
    iteration++;
  }
}  <<< add

This should fix the performance problem as well since the code will no longer copy the arrays at each compute region and instead only copy before the while loop.

FYI, I also have the libcupti.so not found ( I have export PGI_ACC_TIME=1).

libcupti is the device profiler library. NV_ACC_TIME (or the older PGI_ACC_TIME) will still work without, but the timing will be measured from the host rather than the device.

To get libcupti, you’ll need to set the environment variable LD_LIBRARY_PATH to include the directory where it’s found. The exact directory will depend on which CUDA version you’re using.

For example, on my system which has a CUDA 12.2 driver, I’d set:

LD_LIBRARY_PATH=/opt/nv/Linux_x86_64/23.7/cuda/12.2/extras/CUPTI/lib64/

Adjust the path for your install and system.

Hope this helps,
Mat

seshu.basava70 · August 24, 2023, 3:44pm

Hi Mat,
Thanks for the reply. I did try this fix of adding a block. This removed the warning of nvc++ compiler but it is not speeding up. It is still taking same time as the basic version of openacc code without the #pragma acc data copy(A), create(B).

MatColgrove · August 24, 2023, 4:33pm

Hmm, seems ok to me. On my Skylake+v100, host serial is 11 seconds, and less than 1 on the V100

% nvc++ -Ofast jacobi_iteration_sequential.cc -acc -V23.7; a.out
 --- Iteration : 0 Error : 50 ----
 --- Iteration : 500 Error : 0.0706345 ----
 --- Iteration : 1000 Error : 0.0350626 ----
 --- Iteration : 1500 Error : 0.0232336 ----
 --- Iteration : 2000 Error : 0.0173339 ----
 --- Iteration : 2500 Error : 0.0138071 ----
 --- Iteration : 3000 Error : 0.0114576 ----
End of jacobi sequential function
 --- Iteration : 3428 Error : 0.00999902 ----
 ------ Sequential code success !! ------
Time elapsed : 11 seconds
 --- Iteration : 0 Error : 50 ----
 --- Iteration : 500 Error : 0.0706345 ----
 --- Iteration : 1000 Error : 0.0350626 ----
 --- Iteration : 1500 Error : 0.0232336 ----
 --- Iteration : 2000 Error : 0.0173339 ----
 --- Iteration : 2500 Error : 0.0138071 ----
 --- Iteration : 3000 Error : 0.0114576 ----
End of jacobi sequential function
 --- Iteration : 3428 Error : 0.00999902 ----
 ------ OpenACC advanced code success !! ------
Time elapsed : 0 seconds

seshu.basava70 · August 24, 2023, 4:58pm

This is my output. I didn’t add -0fast to the compiler, but is that necessary ?

seshu.basava70 · August 24, 2023, 5:20pm

Nothing changed even with nvc++ -Ofast jacobi_iteration_sequential.cc -acc -V23.7
I have the following installed:

I have an RTX 3070 GPU and AMD Ryzen 5 2600x Six-Core Processor.
Thanks for the help in advance.

MatColgrove · August 24, 2023, 6:21pm

No, -Ofast may help a bit but that’s not the problem. The loop at line 137 is running serially on the device.

Though given the second kernel is at line 105 in copy you posted above, I’m assuming you’ve modified the code? If so what did you do?

Try adding the flag “-Minfo=accel” to see the compiler feedback messages. This should give clues as to why the second kernel isn’t getting parallelized.

-Mat

seshu.basava70 · August 24, 2023, 6:35pm

jacobi_iteration_sequential.cc (5.3 KB)

CMakeLists.txt (324 Bytes)

I just added a basic openacc kernel without the #pragma acc data copy commands.
This the output of make

MatColgrove · August 24, 2023, 6:50pm

Hmm, I can’t explain the disconnect, by the code parallelizes for me:

% nvc++ jacobi_iteration_sequential1.cc -acc -V23.7 -Minfo=accel
jacobi_iteration_openacc_basic():
     91, Generating implicit copyout(B[1:1000][1:2000]) [if not already present]
         Generating implicit copyin(A[:][:]) [if not already present]
     94, Loop is parallelizable
     95, Loop is parallelizable
         Generating NVIDIA GPU code
         94, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
         95,   /* blockIdx.x threadIdx.x auto-collapsed */
    103, Generating implicit copy(A[1:1000][1:2000]) [if not already present]
         Generating implicit copyin(B[1:1000][1:2000]) [if not already present]
         Generating implicit copy(error) [if not already present]
    105, Loop is parallelizable
    106, Loop is parallelizable
         Generating NVIDIA GPU code
        105, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(max:error)
        106,   /* blockIdx.x threadIdx.x auto-collapsed */
jacobi_iteration_openacc_advanced():
    124, Generating copy(A[:][:]) [if not already present]
         Generating create(B[:][:]) [if not already present]
    128, Loop is parallelizable
    129, Loop is parallelizable
         Generating NVIDIA GPU code
        128, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
        129,   /* blockIdx.x threadIdx.x auto-collapsed */
    137, Generating implicit copy(error) [if not already present]
    139, Loop is parallelizable
    140, Loop is parallelizable
         Generating NVIDIA GPU code
        139, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(max:error)
        140,   /* blockIdx.x threadIdx.x auto-collapsed */

The issue you’re seeing is due to the compiler not being able to add the implicit max reduction for “error”. You should be able to fix this by explicitly adding the reduction.

       // error calculation
      error = 0;
#pragma acc kernels loop collapse(2) reduction(max:error)
      for (size_t iter_x = 1; iter_x < width + 1; iter_x++) {
        for (size_t iter_y = 1; iter_y < height + 1; iter_y++) {
          error = fmax(fabs(B[iter_x][iter_y] - A[iter_x][iter_y]), error);
          A[iter_x][iter_y] = B[iter_x][iter_y];
        }
      }

seshu.basava70 · August 24, 2023, 7:09pm

Okay, it worked :). Thanks a ton. But, my question is why did it work for you without the max reduction ?
I was actually planning to make another kernel with data copy and max reduction to compare this as (fully parallel_kernel) to just data copy (advanced_kernel).
I have seen other github pages as well with just data copy and no max reduction. They all seem to report just 1 second or less computation time.

MatColgrove · August 24, 2023, 7:39pm

Again, I’m not sure. My only theory would if the system or C++ STL (we use the g++ STL in order to support interoperability with g++) “fmax” were being presented in a way that the compiler doesn’t recognize it, then it would be treated as another function.

What OS are you using and what g++ version?

seshu.basava70 · August 24, 2023, 7:48pm

g++ version 11.4.0
OS Ubuntu 22.04

MatColgrove · August 24, 2023, 8:22pm

Thanks. I was able to reproduce the problem.

Looks to be a language level issue with C++17. Try adding “–std=c++11” so the C version of fmax is used and the compiler can detect it.

seshu.basava70 · August 24, 2023, 8:52pm

Thank you very much for the explanation. Just to make it clear, c++11 uses the correct fmax. I could get the 1 second computation time after removing reduction max kernel.
But if I am using the default c++17 , I can use std::max and get the same speed with no reduction max kernel ?

seshu.basava70 · August 24, 2023, 9:02pm

for this code

with -std=c++17, I got

clearly stating it cannot be parallelized.
is there a way, that I can use the std::max and std::abs and recreate the speed of 1 second by using
#pragma acc kernels and not #pragma acc kernels loop collapse(2) reduction(max: error) ?

MatColgrove · August 24, 2023, 9:11pm

std::max is a function call so not something the compiler could recognize to implicitly create the reduction.

Keep in mind that implicit reduction is a feature of NVHPC and not part of the OpenACC standard. If you want to be standard compliant and be able to use other compilers, it may be better to explicitly add the reduction clause.

seshu.basava70 · August 24, 2023, 9:33pm

Got it!! Thanks a ton for all the help. I learned a lot.
All the best and have a nice day!!

Topic		Replies	Views
Openacc Example running slower with GPU nvc, nvc++ and nvfortran	7	870	June 19, 2022
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20307	October 25, 2017
OpenACC Region: Command exited with non-zero status 1 nvc, nvc++ and nvfortran cuda	21	1849	October 14, 2021
OpenACC Accelerator restriction: call to 'function' with no acc routine information nvc, nvc++ and nvfortran	9	492	November 26, 2024
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10415	May 2, 2014
PGF90-F-0155-Compiler failed to translate accelerator region Legacy PGI Compilers	6	9263	December 6, 2013
OpenACC - Basic Relaxation Method Not Being Accelerated Legacy PGI Compilers	5	5863	November 27, 2015
Reduction prevents parallel execution on two GPUs Legacy PGI Compilers	5	5696	March 11, 2014
Launch of the kernel Legacy PGI Compilers	4	2862	October 18, 2017
MatMul with openACC Legacy PGI Compilers	7	13019	December 17, 2012

NVC++-W-0155-Invalid accelerator data region: branching into or out of region is not allowed

Related topics