PGI and OpenACC - problem with collapse clause

leonip · May 20, 2014, 4:00pm

Hi all,
I’m working on a little C software in order to testing GPU / manycore accelerators.

On Nvidia Kepler20 compiling with PGI 14.4, I’ve found this strange problem:

using the following code (it’s an excerpt), performance is quite low, but it produces right results:

#pragma acc kernels present(grid,next_grid,sum,A,B)
  {
    #pragma acc loop gang independent
    for (i=rmin_int; i<=rmax_int; i++) {  // righe
      #pragma acc loop independent
      for (j=cmin_int; j<=cmax_int; j++) {  // colonne
        #pragma acc loop vector reduction(+:sum)
        for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP

        // LIFE
        neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
        if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
          next_grid[i][j] = 0.0;
        else if ( neighbors == 3.0 )
          next_grid[i][j] = 1.0;
        else
          next_grid[i][j] =  grid[i][j];
      }
    }
}

If I add “collapse” clause, the kernel is launched but I verified through nvvp that it does not compute (indeed it runs much faster):

#pragma acc kernels present(grid,next_grid,sum,A,B)
  {
    #pragma acc loop gang collapse(3) independent reduction(+:sum)
    for (i=rmin_int; i<=rmax_int; i++) {  // righe
      for (j=cmin_int; j<=cmax_int; j++) {  // colonne
        for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP

        // LIFE
        neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
        if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
          next_grid[i][j] = 0.0;
        else if ( neighbors == 3.0 )
          next_grid[i][j] = 1.0;
        else
          next_grid[i][j] =  grid[i][j];
      }
    }
  }

Compilation command: pgcc -Mmpi=mpich life.c -o life_acckep -acc -ta=tesla:kepler -Minfo=accel

Thanks in advance.
~p

MatColgrove · May 20, 2014, 4:33pm

Hi Paolo,

“Collapse” can only be used on tightly nested loops, but you’re inner loop isn’t tightly nested. What does the “-Minfo=accel” message say? Is the compiler generating a kernel?

Try this schedule:

    #pragma acc loop gang worker collapse(2) independent reduction(+:sum) 
    for (i=rmin_int; i<=rmax_int; i++) {  // righe 
      for (j=cmin_int; j<=cmax_int; j++) {  // colonne 
   #pragma acc loop vector
        for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP

Mat

leonip · May 20, 2014, 5:55pm

Hi Mat, thank you for your reply.

This is output of Minfo=accel in both case:

with collapse:

compute_Internals:
    555, Generating present(grid[:][:])
         Generating present(next_grid[:][:])
         Generating present(sum)
         Generating present(A[:])
         Generating present(B[:])
         Generating Tesla code
    562, Loop is parallelizable
    564, Loop is parallelizable
         Accelerator kernel generated
        562, #pragma acc loop vector(128) collapse(3) /* threadIdx.x */
        564,   /* threadIdx.x collapsed */
        568,   /* threadIdx.x collapsed */
             Sum reduction generated for sum
         Loop is parallelizable

without collapse:

compute_Internals:
    555, Generating present(grid[:][:])
         Generating present(next_grid[:][:])
         Generating present(sum)
         Generating present(A[:])
         Generating present(B[:])
         Generating Tesla code
    562, Loop is parallelizable
    564, Loop is parallelizable
         Accelerator kernel generated
        562, #pragma acc loop gang(300) /* blockIdx.x */
        564, #pragma acc loop worker(16) /* threadIdx.y */
        568, #pragma acc loop vector(16) /* threadIdx.x */
             Sum reduction generated for sum
         Loop is parallelizable

I’ve included your advice but the code seems to be slow.
This is the output also for the modify that you’ve suggested:

compute_Internals:
    555, Generating present(grid[:][:])
         Generating present(next_grid[:][:])
         Generating present(sum)
         Generating present(A[:])
         Generating present(B[:])
         Generating Tesla code
    562, Accelerator restriction: scalar variable live-out from loop: sum
    564, Loop is parallelizable
         Accelerator kernel generated
        562, #pragma acc loop  collapse(2)
        564,   collapsed */
             Sum reduction generated for sum
        568, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for sum
         Loop is parallelizable

~p

MatColgrove · May 20, 2014, 6:15pm

Ok, let’s try splitting the loops. The inner summation loop is independent of the second section of code but because the second section isn’t tightly nested, it’s inhibiting some paralleization.

#pragma acc kernels present(grid,next_grid,sum,A,B) 
  { 
    #pragma acc loop gang collapse(3) independent reduction(+:sum) 
     for (i=rmin_int; i<=rmax_int; i++) {  // righe 
       for (j=cmin_int; j<=cmax_int; j++) {  // colonne 
         for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP 
     }}
    #pragma acc loop independent 
     for (i=rmin_int; i<=rmax_int; i++) {  // righe 
     #pragma acc loop independent 
       for (j=cmin_int; j<=cmax_int; j++) {  // colonne 

         // LIFE 
         neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1]; 
         if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) ) 
           next_grid[i][j] = 0.0; 
         else if ( neighbors == 3.0 ) 
           next_grid[i][j] = 1.0; 
         else 
           next_grid[i][j] =  grid[i][j]; 
       } 
     }

For the first loops, you may try using “#pragma acc loop independent reduction(+:sum)” on each loop level instead of using “collapse” since this will allow the compiler to use multi-dimensional blocks.

For the second set of loops, try using “collapse(2)”.

leonip · May 21, 2014, 6:48am

The second advice (#pragma acc loop independent reduction(+:sum) for the first set of loops and collapse(2) for the second set), seems to have solved my problem, now I’ve good performance and right output results.

Thank you Mat!

~p

Topic		Replies	Views
how dose PGI manage collapse clause Legacy PGI Compilers	3	7166	May 21, 2014
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10416	May 2, 2014
MatMul with openACC Legacy PGI Compilers	7	13039	December 17, 2012
FATAL ERROR at run time Legacy PGI Compilers	5	8115	December 18, 2014
Performance of pgi openaccfor a matrix-matrix multiplication Legacy PGI Compilers	2	4735	May 1, 2014
How to compile if functions defined elsewhere Legacy PGI Compilers	3	5389	June 13, 2018
acc kernels / acc parallel question Legacy PGI Compilers	2	3859	September 1, 2017
Compilation problems for loop parallelization Legacy PGI Compilers	8	4510	May 21, 2012
Tiled loop increase MSE error of a nlm denoise filter result compared to the result obtained when using collapse clause Legacy PGI Compilers	11	5460	May 22, 2020
Performance decrease with PGI 12.1 Legacy PGI Compilers	11	6316	May 10, 2012

PGI and OpenACC - problem with collapse clause

Related topics