Hello all. I’m using a tile() clause in one of my doubly nested loops. It looks like this:
#pragma acc kernels loop tile(256,256)
DO(i,zc){I t=1&(rv[i/8]>>(i%8));
DO(j,jc-1){B x=(j+1)*zc+i;
t=(1&(rv[x/8]>>(x%8)))+t;}
zv[i]=t;}
Without the tile(…) clause things work fine. With the tile clause things go horribly wrong and the answers look almost like they aren’t doing any computation at all. Here’s the output for this line of code from the compiler:
621, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
621, #pragma acc loop gang, vector tile(256,256) /* blockIdx.x threadIdx.x */
622, /* blockIdx.x threadIdx.x tiled */
623, Generating implicit reduction(+:t)
622, Loop is parallelizable
I tried using just tile(256) to see if that had any effect, but then there appears to be no tiling done at all at that point and things work fine, with nothing mentioned in the output. Here’s the output of the case where I use no tile clause:
621, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
621, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
622, #pragma acc loop seq
622, Loop is parallelizable
I can try to get a working example put together, but it will take a bit of time. I just wanted to see if there was something that I was obviously doing wrong here that could be causing this problem.