PGI and OpenACC - problem with collapse clause

Hi all,
I’m working on a little C software in order to testing GPU / manycore accelerators.

On Nvidia Kepler20 compiling with PGI 14.4, I’ve found this strange problem:

using the following code (it’s an excerpt), performance is quite low, but it produces right results:

#pragma acc kernels present(grid,next_grid,sum,A,B)
  {
    #pragma acc loop gang independent
    for (i=rmin_int; i<=rmax_int; i++) {  // righe
      #pragma acc loop independent
      for (j=cmin_int; j<=cmax_int; j++) {  // colonne
        #pragma acc loop vector reduction(+:sum)
        for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP

        // LIFE
        neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
        if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
          next_grid[i][j] = 0.0;
        else if ( neighbors == 3.0 )
          next_grid[i][j] = 1.0;
        else
          next_grid[i][j] =  grid[i][j];
      }
    }
}

If I add “collapse” clause, the kernel is launched but I verified through nvvp that it does not compute (indeed it runs much faster):

#pragma acc kernels present(grid,next_grid,sum,A,B)
  {
    #pragma acc loop gang collapse(3) independent reduction(+:sum)
    for (i=rmin_int; i<=rmax_int; i++) {  // righe
      for (j=cmin_int; j<=cmax_int; j++) {  // colonne
        for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP

        // LIFE
        neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
        if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
          next_grid[i][j] = 0.0;
        else if ( neighbors == 3.0 )
          next_grid[i][j] = 1.0;
        else
          next_grid[i][j] =  grid[i][j];
      }
    }
  }

Compilation command: pgcc -Mmpi=mpich life.c -o life_acckep -acc -ta=tesla:kepler -Minfo=accel


Thanks in advance.
~p

Hi Paolo,

“Collapse” can only be used on tightly nested loops, but you’re inner loop isn’t tightly nested. What does the “-Minfo=accel” message say? Is the compiler generating a kernel?

Try this schedule:

    #pragma acc loop gang worker collapse(2) independent reduction(+:sum) 
    for (i=rmin_int; i<=rmax_int; i++) {  // righe 
      for (j=cmin_int; j<=cmax_int; j++) {  // colonne 
   #pragma acc loop vector
        for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP
  • Mat

Hi Mat, thank you for your reply.

This is output of Minfo=accel in both case:

with collapse:

compute_Internals:
    555, Generating present(grid[:][:])
         Generating present(next_grid[:][:])
         Generating present(sum)
         Generating present(A[:])
         Generating present(B[:])
         Generating Tesla code
    562, Loop is parallelizable
    564, Loop is parallelizable
         Accelerator kernel generated
        562, #pragma acc loop vector(128) collapse(3) /* threadIdx.x */
        564,   /* threadIdx.x collapsed */
        568,   /* threadIdx.x collapsed */
             Sum reduction generated for sum
         Loop is parallelizable

without collapse:

compute_Internals:
    555, Generating present(grid[:][:])
         Generating present(next_grid[:][:])
         Generating present(sum)
         Generating present(A[:])
         Generating present(B[:])
         Generating Tesla code
    562, Loop is parallelizable
    564, Loop is parallelizable
         Accelerator kernel generated
        562, #pragma acc loop gang(300) /* blockIdx.x */
        564, #pragma acc loop worker(16) /* threadIdx.y */
        568, #pragma acc loop vector(16) /* threadIdx.x */
             Sum reduction generated for sum
         Loop is parallelizable

I’ve included your advice but the code seems to be slow.
This is the output also for the modify that you’ve suggested:

compute_Internals:
    555, Generating present(grid[:][:])
         Generating present(next_grid[:][:])
         Generating present(sum)
         Generating present(A[:])
         Generating present(B[:])
         Generating Tesla code
    562, Accelerator restriction: scalar variable live-out from loop: sum
    564, Loop is parallelizable
         Accelerator kernel generated
        562, #pragma acc loop  collapse(2)
        564,   collapsed */
             Sum reduction generated for sum
        568, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for sum
         Loop is parallelizable

~p

Ok, let’s try splitting the loops. The inner summation loop is independent of the second section of code but because the second section isn’t tightly nested, it’s inhibiting some paralleization.

#pragma acc kernels present(grid,next_grid,sum,A,B) 
  { 
    #pragma acc loop gang collapse(3) independent reduction(+:sum) 
     for (i=rmin_int; i<=rmax_int; i++) {  // righe 
       for (j=cmin_int; j<=cmax_int; j++) {  // colonne 
         for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP 
     }}
    #pragma acc loop independent 
     for (i=rmin_int; i<=rmax_int; i++) {  // righe 
     #pragma acc loop independent 
       for (j=cmin_int; j<=cmax_int; j++) {  // colonne 

         // LIFE 
         neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1]; 
         if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) ) 
           next_grid[i][j] = 0.0; 
         else if ( neighbors == 3.0 ) 
           next_grid[i][j] = 1.0; 
         else 
           next_grid[i][j] =  grid[i][j]; 
       } 
     }

For the first loops, you may try using “#pragma acc loop independent reduction(+:sum)” on each loop level instead of using “collapse” since this will allow the compiler to use multi-dimensional blocks.

For the second set of loops, try using “collapse(2)”.

The second advice (#pragma acc loop independent reduction(+:sum) for the first set of loops and collapse(2) for the second set), seems to have solved my problem, now I’ve good performance and right output results.

Thank you Mat!

~p