issue with collapse clause when switching from 18.4 to 18.10

Hi,

due to some other strange compiler issue I had to modify a regular threefold loop nest as follows:

#pragma acc parallel loop independent collapse(3)
    for(int i = xmin; i < xmax; i += xstep) {
      int ii = i;
      for(int j = ymin; j < ymax; j += ystep) {
	      int jj = j;
	      for(int k = zmin; k < zmax; k += zstep) {
	         int kk = k;
                 /* some computaton*/
        } 
      }
    }

pgc++ 18.4 seems to accept the collapse clause despite not being a tightly nested loop nest:

      10216, #pragma acc loop gang collapse(3) /* blockIdx.x */                                                                                                                                       
      10218,   /* blockIdx.x collapsed */                                                                                                                                                             
      10220,   /* blockIdx.x collapsed */

whereas pgc++ 18.10 complains

"test.C", line 10194: error: count in COLLAPSE
          clause does not match number of loops, expected 3 loops but found
          only 2
  #pragma acc parallel loop independent collapse(3) \
  ^

"test.C", line 10194: error: count in COLLAPSE
          clause does not match number of loops, expected 3 loops but found
          only 1
  #pragma acc parallel loop independent collapse(3) \

Note that the line numbers are slightly different because there is some commented out code in between the multi-line pragma statement and the actual loop nest and the error message points out the line of the pragma statement whereas 18.4 points at the beginning of the actual loop nest.

Now my question is whether 18.4 just swallows it and pretends to collapse (even though it is not supported), i.e. silently ignoring it and 18.10 just makes it clear that this cannot work or is not supported? Or do the 2 versions really behave/implement this differently.
If 18.4 silently ignored it, i.e. not performance gained, I would just remove the collapse clause. Otherwise, I would also like 18.10 to support it.

Thanks,
LS

The OpenACC specification says that collapsed loops must be tightly nested (meaning no statements between the fors or tailing brackets. I believe there were some bugs in 18.4, and when fixing the bugs we tightened our adherence to the spec.

You can try using the force option to collapse. Or, just manually moving the outer assignments into the innermost loop (which is what force does for you).

  • Brent

Hi Brent

thanks for confirming that the standard mandates tightly nested loops. There was an obscure issue in 17.10 with the tightly nested loops in the original version of the loop nest and the only workaround I found were those “silly” statements between the for loops.
Hopefully, the original problem has been resolved by 18.10 and I can go back to the original, straightforward version that I started off from.
I will have to investigate …
If that doesn’t work, I could probably remove the collapse statement and sacrifice some performance?

Thanks,
LS

If that doesn’t work, I could probably remove the collapse statement and sacrifice some performance?

You might try scheduling the inner loops as worker and vector. Though the performance will largely depend on the trip count for each of the loops. Collapsing loops generally helps when the loop trip counts are small, hence combing the loops increases the parallelism. If the trip counts are large, then collapsing has the effect of increasing the number of gangs (CUDA blocks) and having the compiler create an inner strip-mine loop for the vector (CUDA thread). By scheduling using worker/vector, you effectively lower the number of gangs, but have more work per gang.

Note that independent is implied for loops in a “parallel” region, so using the “independent” clause is not needed here.

#pragma acc parallel loop gang
    for(int i = xmin; i < xmax; i += xstep) { 
      int ii = i; 
#pragma acc loop worker
      for(int j = ymin; j < ymax; j += ystep) { 
         int jj = j; 
#pragma acc loop vector
         for(int k = zmin; k < zmax; k += zstep) { 
            int kk = k; 
                 /* some computaton*/ 
        } 
      } 
    }

-Mat