The output array is filled on CPU, but somewhy not filled on GPU

I have a code in
MinRepEx.zip (22.4 KB).
The main goal of this code is to fill in the int array PKA_Tls[0:100]. This array is defined in body/T3AllocateData.h on line 61. And it is filled in tpt/include/T3ElasticEMIonIonImpl.h on line 117 using atomic operations.
The output to the file is done in main.cpp on lines 39-45.
The code works properly on CPU and the correct output file lies in the project directory and is named pka_tpt3.dat. The 3d column in it consists of non-zero values. But on GPU somewhy all the 3d column are zero values.
My PC consists of Intel KNL and GPU Titan V.

Use the compile lines:

cmake . -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc
-DCMAKE_CXX_FLAGS="-march=native -mtune=native -O3 -ipo16
-mcmodel=large -fopenmp" -GNinja -DCMAKE_CXX_STANDARD=17
-DACC=OFF -DCUDA=OFF

on CPU and

cmake . -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++
-DCMAKE_CXX_FLAGS="-acc=gpu -Minfo=acc -tp=haswell -Minline
-mcmodel=medium -cuda -gpu=cc70" -GNinja -DCMAKE_CXX_STANDARD=17
-DACC=ON -DCUDA=ON

on GPU.
Also, there are g++ 10.2.1 and NVIDIA Driver Version: 470.57.02 and CUDA Version: 11.4 on my PC.
Please, be so kind to help me fix this issue.

Hi Andry,

The problem is with the “PKA_Tls” array. Since it’s a global array accessed directly from an OpenACC routine, it needs to be placed in a “declare” directive. Below are my diffs of your source with the corrections:

main.cpp:

--- main.org.cpp        2022-04-01 11:40:14.499537408 -0700
+++ main.cpp    2022-04-01 11:42:23.554083788 -0700
@@ -17,8 +17,7 @@
   MAX_ELEMENT=0;
 #ifdef OPENACC
 #pragma acc data create(ind01,ind23,arr1,arr2,arr3,csBorderDataFS) \
-  copyin(particles,d,ElasticEMIonIonProcess) \
-  copy(PKA_Tls[0:PKA_BINS_NUMBER])
+  copyin(particles,d,ElasticEMIonIonProcess)
   {
 #endif
     for(unsigned int step=1; step<2; ++step)
@@ -28,6 +27,7 @@
     }
 #ifdef OPENACC
   }
+#pragma acc update self(PKA_Tls[0:PKA_BINS_NUMBER])
 #endif

   auto end=std::chrono::steady_clock::now();

./body/include/T3AllocateData.h

--- ./body/include/T3AllocateData.org.h 2022-04-01 11:41:00.640451776 -0700
+++ ./body/include/T3AllocateData.h     2022-04-01 11:41:38.565199929 -0700
@@ -59,6 +59,7 @@
   const int PKA_BINS_NUMBER = 100;
   const double PKA_Delta_Tls = (PKA_TlsMax-PKA_TlsMin)/PKA_BINS_NUMBER;
   int PKA_Tls[PKA_BINS_NUMBER]{0};
+#pragma acc declare copyin(PKA_BINS_NUMBER,PKA_Tls[:PKA_BINS_NUMBER])
 #endif//nSi
 }//end of namespace data.

I couldn’t get you cmake to work since I don’t have ninja install, but instead just compiled by hand:

% nvc++ main.cpp body/src/T3MaterialTable.cpp body/src/T3ParticleTable.cpp -Ibody/include -Itpt/include -mcmodel=mium -acc=gpu -fast -DOPENACC=1 -DCUDA=1 -w ; a.out
main.cpp:
body/src/T3MaterialTable.cpp:
body/src/T3ParticleTable.cpp:
time=945 ms, G=0, K=1000000, Ntop=0, SumDG=0
% diff pka_tpt3.dat pka_tpt3.ref.dat
%

Hope this helps,
Mat

Thank You for the answer. I don`t undertand the following. The similar global array particles works properly without acc declare copyin clause. And for PKA_Tls this clause is necessary. Why?

The code is passing particles as an argument to the GetFS routines. PKA_BINDS_NUMBER and PKA_T1s are being directly accessing the global variables so therefor should be included in a declare directive.

1 Like