Hi all,

I’m working on a little C software in order to testing GPU / manycore accelerators.

On Nvidia Kepler20 compiling with PGI 14.4, I’ve found this strange problem:

using the following code (it’s an excerpt), performance is quite low, but it produces right results:

```
#pragma acc kernels present(grid,next_grid,sum,A,B)
{
#pragma acc loop gang independent
for (i=rmin_int; i<=rmax_int; i++) { // righe
#pragma acc loop independent
for (j=cmin_int; j<=cmax_int; j++) { // colonne
#pragma acc loop vector reduction(+:sum)
for (k=0; k < ncomp; k++) sum += A[k] + B[k]; // COMP
// LIFE
neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
next_grid[i][j] = 0.0;
else if ( neighbors == 3.0 )
next_grid[i][j] = 1.0;
else
next_grid[i][j] = grid[i][j];
}
}
}
```

If I add “collapse” clause, the kernel is launched but I verified through nvvp that it does not compute (indeed it runs much faster):

```
#pragma acc kernels present(grid,next_grid,sum,A,B)
{
#pragma acc loop gang collapse(3) independent reduction(+:sum)
for (i=rmin_int; i<=rmax_int; i++) { // righe
for (j=cmin_int; j<=cmax_int; j++) { // colonne
for (k=0; k < ncomp; k++) sum += A[k] + B[k]; // COMP
// LIFE
neighbors = grid[i+1][j+1] + grid[i+1][j] + grid[i+1][j-1] + grid[i][j+1] + grid[i][j-1] + grid[i-1][j+1]+grid[i-1][j]+grid[i-1][j-1];
if ( ( neighbors > 3.0 ) || ( neighbors < 2.0 ) )
next_grid[i][j] = 0.0;
else if ( neighbors == 3.0 )
next_grid[i][j] = 1.0;
else
next_grid[i][j] = grid[i][j];
}
}
}
```

Compilation command: pgcc -Mmpi=mpich life.c -o life_acckep -acc -ta=tesla:kepler -Minfo=accel

Thanks in advance.

~p