Problem with simple loop structure

Joseph_A · March 7, 2018, 6:00am

Hi,

I’m trying to accelerate my program for a Nvidia GPU with compute compability 60.

In my CPU version I used OpenMP to accelerate the code
The output of this code is the square array z_.
Every thread will use an tmp array and all tmp arrays are later reduced into the final z_.

#pragma omp parallel private(tmp)
{ 
      #pragma omp for
      for(int jz = 0; jz < z; jz++) 
      {
         for(int jy = 0; jy < y; jy++) 
         {
           for(int js = 0; s < s; js++)
           {
             for(int jk = 0; jk < s; jk++)
             {
               tmp[js+jk*s] = 0.0;
               for(int jx = 0; jx < x; jx++)
               {
                   int j = (jx+stmx) + mx*(jy+stm) + mxy*(jz+stm);
                   tmp[js+jk*s]+=x[j+m*jk]*y[j+m*js]; 
               }
             }
           }
         }
      }

   
      for(int jk = 0; jk < s; jk++)
      {
           for(int js = 0; js < s; js++)
           {
               #pragma omp atomic
              z_[js+jk*s] += tmp[js+jk*s];
           }
     }
}

My first try with OpenACC resulted in the following code, I simply replaced the OpenMP directives with OpenACC directives.
I checked the compiler output which seems parallelize exactly the loop I specified, however the output result of the program is not correct.
I think there is a problem with the tmp array and the final reduction but I’m not able to track it down.

What is the problem with this code?

#pragma acc data copyin(x[0:s*m],y[0:s*m]) copy(z_[s:s])
#pragma acc kernels 
{ 
      #pragma acc loop independent private(tmp)
      for(int jz = 0; jz < z; jz++) 
      {
         for(int jy = 0; jy < y; jy++) 
         {
           for(int js = 0; js < s; js++)
           {
             for(int jk = 0; jk < s; jk++)
             {
               tmp[js+jk*s] = 0.0;
               for(int jx = 0; jx < x; jx++)
               {
                   int j = (jx+stmx) + mx*(jy+stm) + mxy*(jz+stm);
                   tmp[js+jk*s]+=x[j+m*jk]*y[j+m*js]; 
               }
             }
           }
         }
      }

   
      for(int jk = 0; jk < s; jk++)
      {
        for(int js = 0; js < s; js++)
        {
              #pragma acc atomic update
              z_[js+jk*s] += tmp[js+jk*s];
        }
     }
}

56, Generating copyin(x[:s*m],y[:s*m])
         Generating copy(z_[s:s])
     57, Generating implicit copyin(tmp[:])
     61, Loop is parallelizable
     63, Loop carried reuse of tmp prevents parallelization
     65, Loop carried dependence of tmp prevents parallelization
         Loop carried dependence of tmp prevents vectorization
         Loop carried backward dependence of tmp prevents vectorization
     67, Complex loop carried dependence of tmp prevents parallelization
         Loop carried dependence of tmp prevents parallelization
         Loop carried backward dependence of tmp prevents vectorization
         Accelerator kernel generated
         Generating Tesla code
         61, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         63, #pragma acc loop seq
         65, #pragma acc loop seq
         67, #pragma acc loop seq
         70, #pragma acc loop seq
     70, Complex loop carried dependence of tmp prevents parallelization
         Loop carried reuse of tmp prevents parallelization
         Inner sequential loop scheduled on accelerator
     81, Loop carried dependence of z_ prevents parallelization
         Loop carried backward dependence of z_ prevents vectorization
     83, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         81, #pragma acc loop seq
         83, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Thank you for your help!

MatColgrove · March 7, 2018, 7:56pm

Hi Peter85,

A private variable is only available to the loop on which the clause appears. Hence the private “tmp” array in the first section of loops will not be used in the second section of loops.

Is the real snipit of your code or just an example that you wrote?

I ask because it doesn’t seem correct. It seems to me that the bottom js and jk loops should be enclosed in the jz and jy loops? Otherwise each jz and jy loops will overwrite the values in tmp before written to “z_”.

-Mat

Joseph_A · March 8, 2018, 12:51am

Hi mkcolg,

thank you for your answer!
Yes you are right, the values are overwritten! In the original OpenMP code the array was set to 0 before the nested loop structure was entered. I introduced this bug while experimenting with OpenACC.

According to your explanation do I need to propagate the “tmp” array down to the deepest loop?

The array “tmp” is very small (144*sizeof(double)) elements, I was hoping that OpeACC will place it into the shared memory of the GPU.

#pragma acc data copyin(x[0:s*m],y[0:s*m]) copy(z_[s:s])
 #pragma acc kernels 
 { 

      #pragma acc loop independent private(tmp)
      for(int jk = 0; jk < s*s; jk++) 
      {
          tmp[jk] = 0.0;
      }

      #pragma acc loop independent private(tmp)
      for(int jz = 0; jz < z; jz++) 
      {
         #pragma acc loop private(tmp)
         for(int jy = 0; jy < y; jy++) 
         {
           #pragma acc loop private(tmp)
           for(int js = 0; js < s; js++)
           {
             #pragma acc loop private(tmp)
             for(int jk = 0; jk < s; jk++)
             {
               #pragma acc loop private(tmp)
               for(int jx = 0; jx < x; jx++)
               {
                   int j = (jx+stmx) + mx*(jy+stm) + mxy*(jz+stm);
                   tmp[js+jk*s]+=x[j+m*jk]*y[j+m*js]; 
               }
             }
           }
         }
      }
      for(int jk = 0; jk < s; jk++)
      {
        for(int js = 0; js < s; js++)
        {
         #pragma acc atomic update
         z_[js+jk*s] += tmp[js+jk*s];
ã€€ã€€}
      }
}

Topic		Replies	Views
paralellize some loops with omp + acc Legacy PGI Compilers	16	7026	March 19, 2018
Need advice for OpenACC directives Legacy PGI Compilers	6	7398	July 5, 2016
OPENACC changes value of array Legacy PGI Compilers	12	9852	May 17, 2016
Possible bug in triangular loop Legacy PGI Compilers	5	3923	October 31, 2012
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25313	December 11, 2014
OpenACC on GPU help Legacy PGI Compilers	4	2323	April 20, 2018
How to not parallelize inner loops in OpenACC ? Legacy PGI Compilers	7	3851	May 1, 2020
a 3 levels of loop Legacy PGI Compilers	1	2102	September 6, 2012
OpenMP and Accelerator directives Legacy PGI Compilers	6	15159	January 4, 2010
Question on how to accelerate a complex loop Legacy PGI Compilers	3	1615	March 25, 2019

Problem with simple loop structure

Related topics