I have the following code taken from the book “OpenACC for Programmer” for jacobi iteration
width and height are macros. A[width+2][height+2], B[width+2][height+2] and tolerance are global variables.
The I get the following error
NVC+±W-0155-Invalid accelerator data region: branching into or out of region is not allowed
Plus, the code runs very slow compared to the sequential code, which takes 18 seconds. According to the book and from what I understood, the openacc_advanced function should only take about 1.0 second but it is taking 1006 seconds.
I am attaching my CMakelists.txt
Full file
jacobi_iteration_sequential.cc (3.9 KB)
FYI, I also have the libcupti.so not found ( I have export PGI_ACC_TIME=1).
Thanks for the help in advance.
Hi seshu.basava70
error NVC+±W-0155-Invalid accelerator data region: branching into or out of region is not allowed
Simple fix. Add a structured block for the data region so it’s not associated with the while loop. Something like:
#pragma acc data copy(A), create(B)
{ << add
while (error > tolerance) {
...
iteration++;
}
} <<< add
This should fix the performance problem as well since the code will no longer copy the arrays at each compute region and instead only copy before the while loop.
FYI, I also have the libcupti.so not found ( I have export PGI_ACC_TIME=1).
libcupti is the device profiler library. NV_ACC_TIME (or the older PGI_ACC_TIME) will still work without, but the timing will be measured from the host rather than the device.
To get libcupti, you’ll need to set the environment variable LD_LIBRARY_PATH to include the directory where it’s found. The exact directory will depend on which CUDA version you’re using.
For example, on my system which has a CUDA 12.2 driver, I’d set:
LD_LIBRARY_PATH=/opt/nv/Linux_x86_64/23.7/cuda/12.2/extras/CUPTI/lib64/
Adjust the path for your install and system.
Hope this helps,
Mat
Hi Mat,
Thanks for the reply. I did try this fix of adding a block. This removed the warning of nvc++ compiler but it is not speeding up. It is still taking same time as the basic version of openacc code without the #pragma acc data copy(A), create(B).
Hmm, seems ok to me. On my Skylake+v100, host serial is 11 seconds, and less than 1 on the V100
% nvc++ -Ofast jacobi_iteration_sequential.cc -acc -V23.7; a.out
--- Iteration : 0 Error : 50 ----
--- Iteration : 500 Error : 0.0706345 ----
--- Iteration : 1000 Error : 0.0350626 ----
--- Iteration : 1500 Error : 0.0232336 ----
--- Iteration : 2000 Error : 0.0173339 ----
--- Iteration : 2500 Error : 0.0138071 ----
--- Iteration : 3000 Error : 0.0114576 ----
End of jacobi sequential function
--- Iteration : 3428 Error : 0.00999902 ----
------ Sequential code success !! ------
Time elapsed : 11 seconds
--- Iteration : 0 Error : 50 ----
--- Iteration : 500 Error : 0.0706345 ----
--- Iteration : 1000 Error : 0.0350626 ----
--- Iteration : 1500 Error : 0.0232336 ----
--- Iteration : 2000 Error : 0.0173339 ----
--- Iteration : 2500 Error : 0.0138071 ----
--- Iteration : 3000 Error : 0.0114576 ----
End of jacobi sequential function
--- Iteration : 3428 Error : 0.00999902 ----
------ OpenACC advanced code success !! ------
Time elapsed : 0 seconds
This is my output. I didn’t add -0fast to the compiler, but is that necessary ?
Nothing changed even with nvc++ -Ofast jacobi_iteration_sequential.cc -acc -V23.7
I have the following installed:
I have an RTX 3070 GPU and AMD Ryzen 5 2600x Six-Core Processor.
Thanks for the help in advance.
No, -Ofast may help a bit but that’s not the problem. The loop at line 137 is running serially on the device.
Though given the second kernel is at line 105 in copy you posted above, I’m assuming you’ve modified the code? If so what did you do?
Try adding the flag “-Minfo=accel” to see the compiler feedback messages. This should give clues as to why the second kernel isn’t getting parallelized.
-Mat
jacobi_iteration_sequential.cc (5.3 KB)
CMakeLists.txt (324 Bytes)
I just added a basic openacc kernel without the #pragma acc data copy
commands.
This the output of make
Hmm, I can’t explain the disconnect, by the code parallelizes for me:
% nvc++ jacobi_iteration_sequential1.cc -acc -V23.7 -Minfo=accel
jacobi_iteration_openacc_basic():
91, Generating implicit copyout(B[1:1000][1:2000]) [if not already present]
Generating implicit copyin(A[:][:]) [if not already present]
94, Loop is parallelizable
95, Loop is parallelizable
Generating NVIDIA GPU code
94, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
95, /* blockIdx.x threadIdx.x auto-collapsed */
103, Generating implicit copy(A[1:1000][1:2000]) [if not already present]
Generating implicit copyin(B[1:1000][1:2000]) [if not already present]
Generating implicit copy(error) [if not already present]
105, Loop is parallelizable
106, Loop is parallelizable
Generating NVIDIA GPU code
105, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
Generating implicit reduction(max:error)
106, /* blockIdx.x threadIdx.x auto-collapsed */
jacobi_iteration_openacc_advanced():
124, Generating copy(A[:][:]) [if not already present]
Generating create(B[:][:]) [if not already present]
128, Loop is parallelizable
129, Loop is parallelizable
Generating NVIDIA GPU code
128, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
129, /* blockIdx.x threadIdx.x auto-collapsed */
137, Generating implicit copy(error) [if not already present]
139, Loop is parallelizable
140, Loop is parallelizable
Generating NVIDIA GPU code
139, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
Generating implicit reduction(max:error)
140, /* blockIdx.x threadIdx.x auto-collapsed */
The issue you’re seeing is due to the compiler not being able to add the implicit max reduction for “error”. You should be able to fix this by explicitly adding the reduction.
// error calculation
error = 0;
#pragma acc kernels loop collapse(2) reduction(max:error)
for (size_t iter_x = 1; iter_x < width + 1; iter_x++) {
for (size_t iter_y = 1; iter_y < height + 1; iter_y++) {
error = fmax(fabs(B[iter_x][iter_y] - A[iter_x][iter_y]), error);
A[iter_x][iter_y] = B[iter_x][iter_y];
}
}
Okay, it worked :). Thanks a ton. But, my question is why did it work for you without the max reduction ?
I was actually planning to make another kernel with data copy and max reduction
to compare this as (fully parallel_kernel) to just data copy
(advanced_kernel).
I have seen other github pages as well with just data copy
and no max reduction
. They all seem to report just 1 second or less computation time.
Again, I’m not sure. My only theory would if the system or C++ STL (we use the g++ STL in order to support interoperability with g++) “fmax” were being presented in a way that the compiler doesn’t recognize it, then it would be treated as another function.
What OS are you using and what g++ version?
g++ version 11.4.0
OS Ubuntu 22.04
Thanks. I was able to reproduce the problem.
Looks to be a language level issue with C++17. Try adding “–std=c++11” so the C version of fmax is used and the compiler can detect it.
Thank you very much for the explanation. Just to make it clear, c++11 uses the correct fmax. I could get the 1 second computation time after removing reduction max kernel
.
But if I am using the default c++17 , I can use std::max and get the same speed with no reduction max kernel
?
for this code
with -std=c++17, I got
clearly stating it cannot be parallelized.
is there a way, that I can use the std::max and std::abs and recreate the speed of 1 second by using
#pragma acc kernels
and not
#pragma acc kernels loop collapse(2) reduction(max: error)
?
std::max is a function call so not something the compiler could recognize to implicitly create the reduction.
Keep in mind that implicit reduction is a feature of NVHPC and not part of the OpenACC standard. If you want to be standard compliant and be able to use other compilers, it may be better to explicitly add the reduction clause.
Got it!! Thanks a ton for all the help. I learned a lot.
All the best and have a nice day!!