 # PGI and OpenACC - problem with collapse clause

Hi all,
I’m working on a little C software in order to testing GPU / manycore accelerators.

On Nvidia Kepler20 compiling with PGI 14.4, I’ve found this strange problem:

using the following code (it’s an excerpt), performance is quite low, but it produces right results:

``````#pragma acc kernels present(grid,next_grid,sum,A,B)
{
#pragma acc loop gang independent
for (i=rmin_int; i<=rmax_int; i++) {  // righe
#pragma acc loop independent
for (j=cmin_int; j<=cmax_int; j++) {  // colonne
#pragma acc loop vector reduction(+:sum)
for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP

// LIFE
neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
next_grid[i][j] = 0.0;
else if ( neighbors == 3.0 )
next_grid[i][j] = 1.0;
else
next_grid[i][j] =  grid[i][j];
}
}
}
``````

If I add “collapse” clause, the kernel is launched but I verified through nvvp that it does not compute (indeed it runs much faster):

``````#pragma acc kernels present(grid,next_grid,sum,A,B)
{
#pragma acc loop gang collapse(3) independent reduction(+:sum)
for (i=rmin_int; i<=rmax_int; i++) {  // righe
for (j=cmin_int; j<=cmax_int; j++) {  // colonne
for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP

// LIFE
neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
next_grid[i][j] = 0.0;
else if ( neighbors == 3.0 )
next_grid[i][j] = 1.0;
else
next_grid[i][j] =  grid[i][j];
}
}
}
``````

Compilation command: pgcc -Mmpi=mpich life.c -o life_acckep -acc -ta=tesla:kepler -Minfo=accel

~p

Hi Paolo,

“Collapse” can only be used on tightly nested loops, but you’re inner loop isn’t tightly nested. What does the “-Minfo=accel” message say? Is the compiler generating a kernel?

Try this schedule:

``````    #pragma acc loop gang worker collapse(2) independent reduction(+:sum)
for (i=rmin_int; i<=rmax_int; i++) {  // righe
for (j=cmin_int; j<=cmax_int; j++) {  // colonne
#pragma acc loop vector
for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP
``````
• Mat

This is output of Minfo=accel in both case:

with collapse:

``````compute_Internals:
555, Generating present(grid[:][:])
Generating present(next_grid[:][:])
Generating present(sum)
Generating present(A[:])
Generating present(B[:])
Generating Tesla code
562, Loop is parallelizable
564, Loop is parallelizable
Accelerator kernel generated
562, #pragma acc loop vector(128) collapse(3) /* threadIdx.x */
Sum reduction generated for sum
Loop is parallelizable
``````

without collapse:

``````compute_Internals:
555, Generating present(grid[:][:])
Generating present(next_grid[:][:])
Generating present(sum)
Generating present(A[:])
Generating present(B[:])
Generating Tesla code
562, Loop is parallelizable
564, Loop is parallelizable
Accelerator kernel generated
562, #pragma acc loop gang(300) /* blockIdx.x */
564, #pragma acc loop worker(16) /* threadIdx.y */
568, #pragma acc loop vector(16) /* threadIdx.x */
Sum reduction generated for sum
Loop is parallelizable
``````

This is the output also for the modify that you’ve suggested:

``````compute_Internals:
555, Generating present(grid[:][:])
Generating present(next_grid[:][:])
Generating present(sum)
Generating present(A[:])
Generating present(B[:])
Generating Tesla code
562, Accelerator restriction: scalar variable live-out from loop: sum
564, Loop is parallelizable
Accelerator kernel generated
562, #pragma acc loop  collapse(2)
564,   collapsed */
Sum reduction generated for sum
568, #pragma acc loop vector(128) /* threadIdx.x */
Sum reduction generated for sum
Loop is parallelizable
``````

~p

Ok, let’s try splitting the loops. The inner summation loop is independent of the second section of code but because the second section isn’t tightly nested, it’s inhibiting some paralleization.

``````#pragma acc kernels present(grid,next_grid,sum,A,B)
{
#pragma acc loop gang collapse(3) independent reduction(+:sum)
for (i=rmin_int; i<=rmax_int; i++) {  // righe
for (j=cmin_int; j<=cmax_int; j++) {  // colonne
for (k=0; k < ncomp; k++)  sum += A[k] + B[k]; // COMP
}}
#pragma acc loop independent
for (i=rmin_int; i<=rmax_int; i++) {  // righe
#pragma acc loop independent
for (j=cmin_int; j<=cmax_int; j++) {  // colonne

// LIFE
neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
next_grid[i][j] = 0.0;
else if ( neighbors == 3.0 )
next_grid[i][j] = 1.0;
else
next_grid[i][j] =  grid[i][j];
}
}
``````

For the first loops, you may try using “#pragma acc loop independent reduction(+:sum)” on each loop level instead of using “collapse” since this will allow the compiler to use multi-dimensional blocks.

For the second set of loops, try using “collapse(2)”.

The second advice (#pragma acc loop independent reduction(+:sum) for the first set of loops and collapse(2) for the second set), seems to have solved my problem, now I’ve good performance and right output results.

Thank you Mat!

~p