Reason for Tile(...) producing incorrect code?

Hello all. I’m using a tile() clause in one of my doubly nested loops. It looks like this:

#pragma acc kernels loop tile(256,256)
 DO(i,zc){I t=1&(rv[i/8]>>(i%8));
  DO(j,jc-1){B x=(j+1)*zc+i;
   t=(1&(rv[x/8]>>(x%8)))+t;}
  zv[i]=t;}

Without the tile(…) clause things work fine. With the tile clause things go horribly wrong and the answers look almost like they aren’t doing any computation at all. Here’s the output for this line of code from the compiler:

    621, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        621, #pragma acc loop gang, vector tile(256,256) /* blockIdx.x threadIdx.x */
        622,   /* blockIdx.x threadIdx.x tiled */
        623, Generating implicit reduction(+:t)
    622, Loop is parallelizable

I tried using just tile(256) to see if that had any effect, but then there appears to be no tiling done at all at that point and things work fine, with nothing mentioned in the output. Here’s the output of the case where I use no tile clause:

    621, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        621, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
        622, #pragma acc loop seq
    622, Loop is parallelizable

I can try to get a working example put together, but it will take a bit of time. I just wanted to see if there was something that I was obviously doing wrong here that could be causing this problem.

Hi Aaron,

The tile clause will map on an NVIDIA device to a CUDA thread block which has a limit of 1024 threads. Hence, a tile size of 256x256 is too large. Can you try a smaller size, such as “tile(32,32)”?

Granted, we have had challenges with tile in the past so I’m not discounting that this could be a compiler error. If the problem still persists with the smaller tile size, please send a reproducing example to PGI Customer Service (trs@pgroup.com) so we can investigate.

Thanks,
Mat