The program fails on GPU

I have a code in source.cpp. It properly works on CPU, but on GPU fails at iteration #23 (1st column in the output file). The difference between the output on CPU and GPU begins in the iteration #14 (last value). I use the compile line for CPU:

cmake . -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc
-DCMAKE_CXX_FLAGS="-march=native -mtune=native -O3 -ipo16 -mcmodel=large
-fopenmp -g" -DCMAKE_CXX_STANDARD=17 -DACC=OFF -DCUDA=OFF

and for GPU:

cmake . -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++
-DCMAKE_CXX_FLAGS="-acc=gpu -Minfo=acc -tp=haswell -Minline
-mcmodel=medium -cuda -gpu=cc70" -DCMAKE_CXX_STANDARD=17 -DACC=ON -DCUDA=ON

The CPU is Intel KNL and GPU is Titan V installed in it. The version of nvc++ is 21.9-0 64-bit.
The correct output for the first 30 iterations on CPU is in the file output.dat.
Please, help me fix this issue.
SOURCE.zip (10.1 KB)

Hi Andrey,

The error I’m seeing is that the code hangs when performing iteration 23 in the compute region starting at line 1109. The easiest solution is to remove the two inner “loop vectors” (lines 1126 and 1158), leaving the “vector” on the outer loop.

My best guess as to what’s happening is that not all threads are hitting one of the implicit barriers so the code stalls. Most likely having to do with the second vector being inside of an if statement with a conditional set in the other vector loop. It looks like the sequential loop between the outer gang and inner vector loops, the vector loops being triangular, and many shared gang variables is causing the compiler to have to insert lots of implicit barriers, so you may be better off performance-wise just parallelizing the outer anyways.

-Mat

Thank You for the answer!