Nested loops in C

I am trying to get the compiler to parallelize across 2 nested loops. This works as expected in fortran, but in C the compiler (pgcc v10.6) states that the inner loop is parallelizable, but does not parallelize it (only the outer loop). I’d be grateful for any advice on how to do this. The below simple example illustrates the problem.
Code:

    20	#pragma accel region
    21	  {
    22	#pragma acc for parallel, vector(16)
    23	    for (i = 0; i<N; i++)
    24	      {
    25	#pragma acc for parallel, vector(16)
    26		for (j = 0; j<N; j++)
    27		  {
    28		    b[i][j] = 2.*a[i][j];
    29		  }
    30	      }
    31	  }//end accel region

Compilation:

[agray3@fermi0 nested]$ pgcc -ta=nvidia:cc20 -Minfo:accel nested.c
main:
     20, Generating copyout(b[0:255][0:255])
         Generating copyin(a[0:255][0:255])
         Generating compute capability 2.0 binary
     23, Loop is parallelizable
         Accelerator kernel generated
         23, #pragma acc for parallel, vector(16)
             CC 2.0 : 8 registers; 4 shared, 48 constant, 0 local memory bytes; 16 occupancy
     26, Loop is parallelizable

Hi Alan,

I’m not too sure why the inner loop is not being scheduled. I’ve sent an example on to one of our compiler engineers to see if it’s a compiler issue or I’m missing something.

Thanks,
Mat

Hi Alan,

I heard back from our compiler engineer. It turns out that this is a known issue that he was planning on addressing for the 11.0 release. However, since several users have recently reported the same issue, we bumped up the priority and were able to add the fix in this month’s 10.9 release.

Thanks,
Mat

% cat test.c
int foo (int N, float ** b, float ** a) {

 int i, j;

#pragma accel region
 {
   for (i = 0; i<N; i++)
     {
  for (j = 0; j<N; j++)
    {
      b[i][j] = 2.*a[i][j];
   }
     }
 }//end accel region

 return 1;
}% pgcc -c test.c -ta=nvidia -Minfo=accel -fast -Msafeptr -Mfcon -V10.9
foo:
      5, Generating copyout(b[0:N-1][0:N-1])
         Generating copyin(a[0:N-1][0:N-1])
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
      7, Loop is parallelizable
      9, Loop is parallelizable
         Accelerator kernel generated
          7, #pragma acc for parallel, vector(16)
          9, #pragma acc for parallel, vector(16)
             CC 1.0 : 6 registers; 24 shared, 40 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 24 shared, 40 constant, 0 local memory bytes; 100 occupancy