Hi all,
I’m working on a little C software in order to testing GPU / manycore accelerators.
On Nvidia Kepler20 compiling with PGI 14.4, I’ve found this strange problem:
using the following code (it’s an excerpt), performance is quite low, but it produces right results:
#pragma acc kernels present(grid,next_grid,sum,A,B)
{
#pragma acc loop gang independent
for (i=rmin_int; i<=rmax_int; i++) { // righe
#pragma acc loop independent
for (j=cmin_int; j<=cmax_int; j++) { // colonne
#pragma acc loop vector reduction(+:sum)
for (k=0; k < ncomp; k++) sum += A[k] + B[k]; // COMP
// LIFE
neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
next_grid[i][j] = 0.0;
else if ( neighbors == 3.0 )
next_grid[i][j] = 1.0;
else
next_grid[i][j] = grid[i][j];
}
}
}
If I add “collapse” clause, the kernel is launched but I verified through nvvp that it does not compute (indeed it runs much faster):
#pragma acc kernels present(grid,next_grid,sum,A,B)
{
#pragma acc loop gang collapse(3) independent reduction(+:sum)
for (i=rmin_int; i<=rmax_int; i++) { // righe
for (j=cmin_int; j<=cmax_int; j++) { // colonne
for (k=0; k < ncomp; k++) sum += A[k] + B[k]; // COMP
// LIFE
neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
next_grid[i][j] = 0.0;
else if ( neighbors == 3.0 )
next_grid[i][j] = 1.0;
else
next_grid[i][j] = grid[i][j];
}
}
}
Compilation command: pgcc -Mmpi=mpich life.c -o life_acckep -acc -ta=tesla:kepler -Minfo=accel
Thanks in advance.
~p
Hi Paolo,
“Collapse” can only be used on tightly nested loops, but you’re inner loop isn’t tightly nested. What does the “-Minfo=accel” message say? Is the compiler generating a kernel?
Try this schedule:
#pragma acc loop gang worker collapse(2) independent reduction(+:sum)
for (i=rmin_int; i<=rmax_int; i++) { // righe
for (j=cmin_int; j<=cmax_int; j++) { // colonne
#pragma acc loop vector
for (k=0; k < ncomp; k++) sum += A[k] + B[k]; // COMP
Hi Mat, thank you for your reply.
This is output of Minfo=accel in both case:
with collapse:
compute_Internals:
555, Generating present(grid[:][:])
Generating present(next_grid[:][:])
Generating present(sum)
Generating present(A[:])
Generating present(B[:])
Generating Tesla code
562, Loop is parallelizable
564, Loop is parallelizable
Accelerator kernel generated
562, #pragma acc loop vector(128) collapse(3) /* threadIdx.x */
564, /* threadIdx.x collapsed */
568, /* threadIdx.x collapsed */
Sum reduction generated for sum
Loop is parallelizable
without collapse:
compute_Internals:
555, Generating present(grid[:][:])
Generating present(next_grid[:][:])
Generating present(sum)
Generating present(A[:])
Generating present(B[:])
Generating Tesla code
562, Loop is parallelizable
564, Loop is parallelizable
Accelerator kernel generated
562, #pragma acc loop gang(300) /* blockIdx.x */
564, #pragma acc loop worker(16) /* threadIdx.y */
568, #pragma acc loop vector(16) /* threadIdx.x */
Sum reduction generated for sum
Loop is parallelizable
I’ve included your advice but the code seems to be slow.
This is the output also for the modify that you’ve suggested:
compute_Internals:
555, Generating present(grid[:][:])
Generating present(next_grid[:][:])
Generating present(sum)
Generating present(A[:])
Generating present(B[:])
Generating Tesla code
562, Accelerator restriction: scalar variable live-out from loop: sum
564, Loop is parallelizable
Accelerator kernel generated
562, #pragma acc loop collapse(2)
564, collapsed */
Sum reduction generated for sum
568, #pragma acc loop vector(128) /* threadIdx.x */
Sum reduction generated for sum
Loop is parallelizable
~p
Ok, let’s try splitting the loops. The inner summation loop is independent of the second section of code but because the second section isn’t tightly nested, it’s inhibiting some paralleization.
#pragma acc kernels present(grid,next_grid,sum,A,B)
{
#pragma acc loop gang collapse(3) independent reduction(+:sum)
for (i=rmin_int; i<=rmax_int; i++) { // righe
for (j=cmin_int; j<=cmax_int; j++) { // colonne
for (k=0; k < ncomp; k++) sum += A[k] + B[k]; // COMP
}}
#pragma acc loop independent
for (i=rmin_int; i<=rmax_int; i++) { // righe
#pragma acc loop independent
for (j=cmin_int; j<=cmax_int; j++) { // colonne
// LIFE
neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
next_grid[i][j] = 0.0;
else if ( neighbors == 3.0 )
next_grid[i][j] = 1.0;
else
next_grid[i][j] = grid[i][j];
}
}
For the first loops, you may try using “#pragma acc loop independent reduction(+:sum)” on each loop level instead of using “collapse” since this will allow the compiler to use multi-dimensional blocks.
For the second set of loops, try using “collapse(2)”.
The second advice (#pragma acc loop independent reduction(+:sum) for the first set of loops and collapse(2) for the second set), seems to have solved my problem, now I’ve good performance and right output results.
Thank you Mat!
~p