Problem with simple loop structure

Hi,

I’m trying to accelerate my program for a Nvidia GPU with compute compability 60.

In my CPU version I used OpenMP to accelerate the code
The output of this code is the square array z_.
Every thread will use an tmp array and all tmp arrays are later reduced into the final z_.


#pragma omp parallel private(tmp)
{ 
      #pragma omp for
      for(int jz = 0; jz < z; jz++) 
      {
         for(int jy = 0; jy < y; jy++) 
         {
           for(int js = 0; s < s; js++)
           {
             for(int jk = 0; jk < s; jk++)
             {
               tmp[js+jk*s] = 0.0;
               for(int jx = 0; jx < x; jx++)
               {
                   int j = (jx+stmx) + mx*(jy+stm) + mxy*(jz+stm);
                   tmp[js+jk*s]+=x[j+m*jk]*y[j+m*js]; 
               }
             }
           }
         }
      }

   
      for(int jk = 0; jk < s; jk++)
      {
           for(int js = 0; js < s; js++)
           {
               #pragma omp atomic
              z_[js+jk*s] += tmp[js+jk*s];
           }
     }
}

My first try with OpenACC resulted in the following code, I simply replaced the OpenMP directives with OpenACC directives.
I checked the compiler output which seems parallelize exactly the loop I specified, however the output result of the program is not correct.
I think there is a problem with the tmp array and the final reduction but I’m not able to track it down.

What is the problem with this code?


#pragma acc data copyin(x[0:s*m],y[0:s*m]) copy(z_[s:s])
#pragma acc kernels 
{ 
      #pragma acc loop independent private(tmp)
      for(int jz = 0; jz < z; jz++) 
      {
         for(int jy = 0; jy < y; jy++) 
         {
           for(int js = 0; js < s; js++)
           {
             for(int jk = 0; jk < s; jk++)
             {
               tmp[js+jk*s] = 0.0;
               for(int jx = 0; jx < x; jx++)
               {
                   int j = (jx+stmx) + mx*(jy+stm) + mxy*(jz+stm);
                   tmp[js+jk*s]+=x[j+m*jk]*y[j+m*js]; 
               }
             }
           }
         }
      }

   
      for(int jk = 0; jk < s; jk++)
      {
        for(int js = 0; js < s; js++)
        {
              #pragma acc atomic update
              z_[js+jk*s] += tmp[js+jk*s];
        }
     }
}



56, Generating copyin(x[:s*m],y[:s*m])
         Generating copy(z_[s:s])
     57, Generating implicit copyin(tmp[:])
     61, Loop is parallelizable
     63, Loop carried reuse of tmp prevents parallelization
     65, Loop carried dependence of tmp prevents parallelization
         Loop carried dependence of tmp prevents vectorization
         Loop carried backward dependence of tmp prevents vectorization
     67, Complex loop carried dependence of tmp prevents parallelization
         Loop carried dependence of tmp prevents parallelization
         Loop carried backward dependence of tmp prevents vectorization
         Accelerator kernel generated
         Generating Tesla code
         61, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         63, #pragma acc loop seq
         65, #pragma acc loop seq
         67, #pragma acc loop seq
         70, #pragma acc loop seq
     70, Complex loop carried dependence of tmp prevents parallelization
         Loop carried reuse of tmp prevents parallelization
         Inner sequential loop scheduled on accelerator
     81, Loop carried dependence of z_ prevents parallelization
         Loop carried backward dependence of z_ prevents vectorization
     83, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         81, #pragma acc loop seq
         83, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Thank you for your help!

Hi Peter85,

A private variable is only available to the loop on which the clause appears. Hence the private “tmp” array in the first section of loops will not be used in the second section of loops.

Is the real snipit of your code or just an example that you wrote?

I ask because it doesn’t seem correct. It seems to me that the bottom js and jk loops should be enclosed in the jz and jy loops? Otherwise each jz and jy loops will overwrite the values in tmp before written to “z_”.

-Mat

Hi mkcolg,

thank you for your answer!
Yes you are right, the values are overwritten! In the original OpenMP code the array was set to 0 before the nested loop structure was entered. I introduced this bug while experimenting with OpenACC.

According to your explanation do I need to propagate the “tmp” array down to the deepest loop?

The array “tmp” is very small (144*sizeof(double)) elements, I was hoping that OpeACC will place it into the shared memory of the GPU.

#pragma acc data copyin(x[0:s*m],y[0:s*m]) copy(z_[s:s])
 #pragma acc kernels 
 { 

      #pragma acc loop independent private(tmp)
      for(int jk = 0; jk < s*s; jk++) 
      {
          tmp[jk] = 0.0;
      }

      #pragma acc loop independent private(tmp)
      for(int jz = 0; jz < z; jz++) 
      {
         #pragma acc loop private(tmp)
         for(int jy = 0; jy < y; jy++) 
         {
           #pragma acc loop private(tmp)
           for(int js = 0; js < s; js++)
           {
             #pragma acc loop private(tmp)
             for(int jk = 0; jk < s; jk++)
             {
               #pragma acc loop private(tmp)
               for(int jx = 0; jx < x; jx++)
               {
                   int j = (jx+stmx) + mx*(jy+stm) + mxy*(jz+stm);
                   tmp[js+jk*s]+=x[j+m*jk]*y[j+m*js]; 
               }
             }
           }
         }
      }
      for(int jk = 0; jk < s; jk++)
      {
        for(int js = 0; js < s; js++)
        {
         #pragma acc atomic update
         z_[js+jk*s] += tmp[js+jk*s];
  }
      }
}