Need advice for OpenACC directives

Hi all,
how can I make this portion of my c code run on GPU using OpenACC?
Here is the code.

#pragma acc kernels	loop	      
		      for (i=1;         i<= n1;                i++)
		      {    
				      for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++)
				      { 
						      if (k1[j] != 0)
						      {
						          i1=k[j];
						          i0=0;
						          for (l=1;   l<= 6;  l++)
						          {
								          for (l1=1; l1<= 6; l1++)
								          {
										          i0++;
										          r[i][l] += -a[j][i0]*z[i1][l1];

								          }
						          }  
						      }
				      } 
				      for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++)
				      {
						      if (k1[j] != 0)
						      {	
						          i1=k[j];
						          for (l=1;   l<= 6;  l++)
						          {
								          i0=l;
								          for (l1=1; l1<= 6; l1++)
								          {
										         r[i1][l] += -a[j][i0]*z[i][l1];
										          i0 +=6; 
								          }  
						          }  
                        }
		    		  }  
				      i0=0;
				      for (j=1; j<= 6; j++)
				      {
						      s[j]=0;
						      for (l=1; l<= 6; l++)
						      {
								      i0++;
								      s[j] += v[i][i0]*r[i][l];								      
						      }  
						      a1 =  s[j];
						      e1 += fabs(a1-z[i][j]);
						      e2 += fabs(a1);
						      z[i][j] = a1;
				      }  
		      }

When I try to compile it, I get these information from GPI.

PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages)
: Could not find allocated-variable index for symbol (d:\pgi\dfgpu.c: 2886)
di20:
   2887, Loop carried dependence due to exposed use of r[1:n1][1:6] prevents parallelization
         Loop carried dependence of a->->,r->->,s,v->->,z->-> prevents parallelization
         Complex loop carried dependence of a->->,r->->,s,v->->,z->-> prevents parallelization
         Loop carried backward dependence of a->->,r->->,s,v->->,z->-> prevents vectorization
         Accelerator restriction: scalar variable live-out from loop: a1,e1,e2,i,i0,i1,j,l,l1,r->->,s,z->->
         Scalar last value needed after loop for e1 at line 2962
         Scalar last value needed after loop for e2 at line 2962
   2896, Loop carried dependence due to exposed use of r[i1+1][1:i1+6] prevents parallelization
         Complex loop carried dependence of a->->,r->->,z->-> prevents parallelization
         Accelerator restriction: induction variable live-out from loop: i
         Accelerator restriction: scalar variable live-out from loop: i0,i1,j,l,l1,r->->
         Accelerator restriction: size of the GPU copy of k,k1 is unknown
   2901, Accelerator restriction: induction variable live-out from loop: j
   2906, Accelerator restriction: induction variable live-out from loop: j
   2908, Complex loop carried dependence of a->->,r->->,z->-> prevents parallelization
         Accelerator restriction: scalar variable live-out from loop: i0,l,l1,r->->
   2910, Loop carried dependence due to exposed use of r[i1+1][i3+1] prevents parallelization
         Complex loop carried dependence of a->->,r->->,z->-> prevents parallelization
         Accelerator restriction: scalar variable live-out from loop: i0,l1,r->->
         Accelerator restriction: size of the GPU copy of a,z is unknown
   2912, Accelerator restriction: induction variable live-out from loop: i0
   2917, Accelerator restriction: induction variable live-out from loop: i,i0,j,l,l1
   2920, Accelerator restriction: induction variable live-out from loop: l1
   2921, Accelerator restriction: induction variable live-out from loop: l
   2923, Accelerator restriction: induction variable live-out from loop: i,j
   2926, Loop carried dependence due to exposed use of r[i1+1][1:i1+6] prevents parallelization
         Complex loop carried dependence of a->->,r->->,z->-> prevents parallelization
         Accelerator restriction: induction variable live-out from loop: i
         Accelerator restriction: scalar variable live-out from loop: i0,i1,j,l,l1,r->->
         Accelerator restriction: size of the GPU copy of k,k1 is unknown
   2928, Accelerator restriction: induction variable live-out from loop: j
   2930, Accelerator restriction: induction variable live-out from loop: j
   2931, Complex loop carried dependence of a->->,r->->,z->-> prevents parallelization
         Accelerator restriction: scalar variable live-out from loop: i0,l,l1,r->->
   2933, Accelerator restriction: induction variable live-out from loop: l
   2934, Complex loop carried dependence of a->->,r->->,z->-> prevents parallelization
         Parallelization requires privatization of r->-> as well as last value
         Accelerator restriction: scalar variable live-out from loop: i0,l1,r->->
         Accelerator restriction: size of the GPU copy of a,r is unknown
   2936, Accelerator restriction: induction variable live-out from loop: i,i0,j,l,l1
   2937, Accelerator restriction: induction variable live-out from loop: i0
   2938, Accelerator restriction: induction variable live-out from loop: l1
   2939, Accelerator restriction: induction variable live-out from loop: l
   2941, Accelerator restriction: induction variable live-out from loop: i,j
   2945, Complex loop carried dependence of r->->,s,v->->,z->-> prevents parallelization
         Accelerator restriction: scalar variable live-out from loop: a1,e1,e2,i0,j,l,s,z->->
         Scalar last value needed after loop for e1 at line 2962
         Scalar last value needed after loop for e2 at line 2962
   2947, Accelerator restriction: induction variable live-out from loop: j
   2948, Complex loop carried dependence of r->->,s,v->-> prevents parallelization
         Parallelization requires privatization of s as well as last value
         Accelerator restriction: scalar variable live-out from loop: i0,l,s
         Accelerator restriction: size of the GPU copy of v is unknown
   2950, Accelerator restriction: induction variable live-out from loop: i0
   2951, Accelerator restriction: induction variable live-out from loop: i,i0,j,l
   2952, Accelerator restriction: induction variable live-out from loop: l
   2953, Accelerator restriction: induction variable live-out from loop: j
   2954, Accelerator restriction: induction variable live-out from loop: i,j
   2956, Accelerator restriction: induction variable live-out from loop: i,j
   2957, Accelerator restriction: induction variable live-out from loop: j
   2958, Accelerator restriction: induction variable live-out from loop: i
   2973, Accelerator restriction: induction variable live-out from loop: i,l
PGC/x86-64 Windows 16.5-0: compilation completed with severe errors

Actuall, I want to parallelize the “i” loop only. Do you have any advice?
Thanks a lot.
Bin

Hi Bin,

For the loop dependency analysis, add the “independent” clause to your loop directive. Right now the compiler can’t tell if your pointers point to the same array or not. “independent” asserts to the compiler that it’s ok to go ahead and parallelize the code.

You’ll need to privatize the “s” array or make it a scalar, otherwise you’ll have a race condition (I see no reason why you’re using s as an array).

You’ll want a reduction clause for the “e1” and “e2” variables.

Finally, you’ll most likely need to add your arrays to a data clause. I doubt the compiler will be able to tell how much of the arrays to copy over.

Try something like the following. Please fill in the correct sizes of the arrays where I use a “?”.

#pragma acc kernels   loop independent \
   copyin(n[0:n1+1][0:3], k1[0:?], k[0:?], a[0:?][0:?]) \
   copy(z[0:?][0:?], r[0:n1+1][0:?])     \
   reduction(+:e1,e2) 
            for (i=1;         i<= n1;                i++) 
            {    
                  for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++) 
                  { 
                        if (k1[j] != 0) 
                        { 
                            i1=k[j]; 
                            i0=0; 
                            for (l=1;   l<= 6;  l++) 
                            { 
                                  for (l1=1; l1<= 6; l1++) 
                                  { 
                                        i0++; 
                                        r[i][l] += -a[j][i0]*z[i1][l1]; 

                                  } 
                            }  
                        } 
                  } 
                  for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++) 
                  { 
                        if (k1[j] != 0) 
                        {    
                            i1=k[j]; 
                            for (l=1;   l<= 6;  l++) 
                            { 
                                  i0=l; 
                                  for (l1=1; l1<= 6; l1++) 
                                  { 
                                       r[i1][l] += -a[j][i0]*z[i][l1]; 
                                        i0 +=6; 
                                  }  
                            }  
                        } 
                  }  
                  i0=0; 
                  for (j=1; j<= 6; j++) 
                  { 
                        s=0; 
                        for (l=1; l<= 6; l++) 
                        { 
                              i0++; 
                              s += v[i][i0]*r[i][l];                              
                        }  
                        a1 =  s; 
                        e1 += fabs(a1-z[i][j]); 
                        e2 += fabs(a1); 
                        z[i][j] = a1; 
                  }  
            }

Hi Mat,

I corrected my code according to your suggestion, but still, it cannot work.

I have no idea about how to fix the distribution multiplicate failure.

Here are the code and -Minfo messages. Do you konw why this happened?

2888 #pragma acc data copyin(n[0:n1][0:3], k1[0:40*n1], k[0:40*n1], a[0:n3][0:36], v[0:n1][0:36])  copy(z[0:n1][0:6], r[0:n1][0:6])
2889		          {
2890		              #pragma acc kernels	for independent reduction(+:e1,e2) private(i0, i1, l, l1)
2891		              for (i=1;         i<= n1;                i++)
2892		              {    
2893				              #pragma acc for independent
2894				              for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++)
2895				              { 
2896						              if (k1[j] != 0)
2897						              {
2898						                  i1=k[j];
2899						                  i0=0;
2900						                  #pragma acc for independent
2901						                  for (l=1;   l<= 6;  l++)
2902						                  {								                  
2903								                  #pragma acc for independent
2904								                  for (l1=1; l1<= 6; l1++)
2905								                  {
2906										                  i0++;
2907										                  r[i][l] += -a[j][i0]*z[i1][l1];
2908								                  } 
2909						                  } 
2910						              }
2911				              }
2912				              #pragma acc for independent
2913				              for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++)
2914				              {
2915						              if (k1[j] != 0)
2916						              {	
2917						                  i1=k[j];
2918						                  for (l=1;   l<= 6;  l++)
2919						                  {
2920								                  i0=l;
2921								                  #pragma acc for independent
2922								                  for (l1=1; l1<= 6; l1++)
2923								                  {
2924									        	         r[i1][l] += -a[j][i0]*z[i][l1];
2925										                  i0 +=6; 
2926								                  } 
2927						                  }
2928                          }
2929		    		          } 
2930				              i0=0;
2931				              #pragma acc for private(a1)
2932				              for (j=1; j<= 6; j++)
2933				              {
2934						              a1=0;
2935						              #pragma acc for reduction(+:a1)
2936						              for (l=1; l<= 6; l++)
2937						              {
2938							         	      i0++;
2939							        	      a1 += v[i][i0]*r[i][l];								      
2940						              }  /*  l  */
2941						              e1 += fabs(a1-z[i][j]);
2942						              e2 += fabs(a1);
2943						              z[i][j] = a1;
2944				              } 
2945		              } 
2946		          }



PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages)
: DistributeMultiplicate failure (d:\pgi\dfgpu.c: 2890)
di20:
   2888, Generating copyin(a[:n3][:36],k1[:n1*40],k[:n1*40],n[:n1][:3])
         Generating copy(r[:n1][:6])
         Generating copyin(v[:n1][:36])
         Generating copy(z[:n1][:6])
   2891, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
       2891, #pragma acc loop gang /* blockIdx.x */
             Generating reduction(+:e1,e2)
       2894, #pragma acc loop seq
       2901, #pragma acc loop vector(32) /* threadIdx.x */
       2904, #pragma acc loop seq
       2913, #pragma acc loop seq
       2918, #pragma acc loop vector(32) /* threadIdx.x */
       2922, #pragma acc loop seq
       2932, #pragma acc loop seq
       2936, #pragma acc loop vector(32) /* threadIdx.x */
             Generating reduction(+:a1)
   2891, Accelerator restriction: scalar variable live-out from loop: z->->
   2894, Loop is parallelizable
   2901, Loop is parallelizable
   2904, Loop is parallelizable
   2913, Loop is parallelizable
   2918, Loop is parallelizable
   2922, Loop is parallelizable
   2932, Loop is parallelizable
         Scalar last value needed after loop for e1 at line 2951
         Scalar last value needed after loop for e2 at line 2951
   2936, Loop is parallelizable
PGC/x86-64 Windows 16.5-0: compilation completed with severe errors

Hi Bin

“distribution multiplicate failure” looks like a compiler error. If you can send in a reproducing example to PGI Customer Service (trs@pgroup.com), I would appreciate it.

However, the compiler error may be caused by some incorrect code since you need an “independent” in the directive at line 2931.

Other issues that I see:

  • Minor nit is to use “loop” instead of “for”. The OpenACC subcommittee is considering making “for” and “do” synonyms for “loop”, but it’s not part of the spec as of yet. PGI allows for “for” as an extension, but you’ll have more portable code if you use “loop”.
  • Unlike OpenMP, in OpenACC scalars are private by default. Except for a few cases, there’s no need to put them in a private clause.
  • Your code has a race condition at line 2924. Consider adding an atomic operation.
  • Line 2907 is also problematic since parallelizing the “l1” loop would cause a race condition. I’d recommend to not parallelize these inner loops, at least to start, at only parallelize the outermost “i” loop. Once you have things working, then you might explore parallelizing the inner loops.

Try something like the following:

2888 #pragma acc data copyin(n[0:n1][0:3], k1[0:40*n1], k[0:40*n1], a[0:n3][0:36], v[0:n1][0:36])  copy(z[0:n1][0:6], r[0:n1][0:6]) 
 2889                { 
 2890                    #pragma acc kernels loop independent reduction(+:e1,e2) 
 2891                    for (i=1;         i<= n1;                i++) 
 2892                    {    
 2893                          //pragma acc loop independent 
 2894                          for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++) 
 2895                          { 
 2896                                if (k1[j] != 0) 
 2897                                { 
 2898                                    i1=k[j]; 
 2899                                    i0=0; 
 2900                                    
 2901                                    for (l=1;   l<= 6;  l++) 
 2902                                    {                                          
 2903                                          
 2904                                          for (l1=1; l1<= 6; l1++) 
 2905                                          { 
 2906                                                i0++; 
 2907                                                r[i][l] += -a[j][i0]*z[i1][l1]; 
 2908                                          } 
 2909                                    } 
 2910                                } 
 2911                          } 
 2912                          //pragma acc loop independent 
 2913                          for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++) 
 2914                          { 
 2915                                if (k1[j] != 0) 
 2916                                {    
 2917                                    i1=k[j]; 
 2918                                    for (l=1;   l<= 6;  l++) 
 2919                                    { 
 2920                                          i0=l; 
 2921                                          
 2922                                          for (l1=1; l1<= 6; l1++) 
 2923                                          { 
                                  #pragma acc atomic update
 2924                                               r[i1][l] += -a[j][i0]*z[i][l1]; 
 2925                                                i0 +=6; 
 2926                                          } 
 2927                                    } 
 2928                          } 
 2929                          } 
 2930                          i0=0; 
 2931                          //pragma acc loop independent
 2932                          for (j=1; j<= 6; j++) 
 2933                          { 
 2934                                a1=0; 
 2935                                 
 2936                                for (l=1; l<= 6; l++) 
 2937                                { 
 2938                                       i0++; 
 2939                                      a1 += v[i][i0]*r[i][l];                              
 2940                                }  /*  l  */ 
 2941                                e1 += fabs(a1-z[i][j]); 
 2942                                e2 += fabs(a1); 
 2943                                z[i][j] = a1; 
 2944                          } 
 2945                    } 
 2946                }

Hi Mat,

I have sent an email in which a reproducing example is attached to trs@pgroup.com.

I understand scalars are private by default. But if I do not make the scalars private, the compiler will give information like : 1018, Accelerator restriction: scalar variable live-out from loop: i0,i1,l,l1
,r->->,z->->.

Please check the email. Thanks a lot.

Bin

Hi Mat

I am facing another problem. Do you know why I cannot use i0=(l-1)*6+l1; instead of i0++ in the following code when I compile it with -acc command?

Actually, they are the same and I can compile it using PGI witheout -acc successfully.

		      #pragma acc data copyin(n[0:n1][0:3], k1[0:40*n1], k[0:40*n1], a[0:n3][0:36], v[0:n1][0:36])  copy(z[0:n1][0:6], r[0:n1][0:6])
		          {
		              #pragma acc kernels 
		              for (i=1;         i<= n1;                i++)
		              {             
				              for (j=n[i][1]+1; j<= n[i][1]+n[i][2]-1; j++)
				              { 
						              if (k1[j] != 0)
						              {
						                  i1=k[j];
						                  i0=0;
						                  for (l=1;   l<= 6;  l++)
						                  {								                  
								                  for (l1=1; l1<= 6; l1++)
								                  {
										                  i0++;
										                  //i0=(l-1)*6+l1;
										                  r[i][l] += -a[j][i0]*z[i1][l1];
								                  }  
						                  } 
						              }
				              }

PGI will give these info.

d:\pgi\dfgpu.c:
Unimplemented opcode: 0
PGC-F-0000-Internal compiler error. Unimplemented opcode.       4 (d:\pgi\dfgpu.
c)
PGC/x86-64 Windows 16.5-0: compilation aborted

Hi Bin,

You can use computed indices, however the compiler can not determine independence so you’ll need to either use the “parallel” directive or add “loop independent” to the parallelizable loops.

However, the error you’re seeing is a compiler issue. Can you please send a reproducing example to PGI Customer Service (trs@pgroup.com) so we can investigate?

  • Mat