Strange CUDA grid/block choices in kernels region

Hi,

I have the following code:

!$acc kernels present(vp)
      vp%r=0.
      vp%t=0.
      vp%p=0.
!$acc end kernels

where I am using deepcopy for “vp”.

The code works, but when I look at the compiler output, I see:

28047, Loop is parallelizable
         Accelerator scalar kernel generated
         Accelerator kernel generated
         Generating Tesla code
      28047, !$acc loop gang, vector(4) ! blockidx%z threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             !$acc loop gang ! blockidx%y

(Line 28047 is one of the three assignments in the above kernels region.)

To me, the chosen arrangement of CUDA blocks and threads seems strange. vp%r is a 3D array which looks like is being mapped to [BLK Y][BLK X THREAD X][BLK Z, THREAD Y].

I have never seen a dimension being assigned to a block-thread combo with different directions before.

What is actually happening here?

What does the “Accelerator scalar kernel generated” mean as compared to the “Accelerator kernel generated”?
Does this mean a non-parallel kernel is being created as well?

Thanks!

Hi sumseq,

What is actually happening here?

Since you’re using “kernels”, the compiler is free to use the best available schedule. It’s more like they are being blocked as “[BLK X, THREAD X][BLK Z THREAD Y][BLK Y]”. So it’s vectorizing over the first and second dimension of the array in 32x4 blocks and spreading those blocks across all three dimensions of the grid. So, yes the order shown in the compiler feedback message is a bit misleading, but the compiler is doing the correct thing and striding over the contiguous dimension in groups of 32 (i.e. a warp). The blocks don’t matter much in this case since there’s no shared memory and using a 3-D block just helps with indexing of the arrays.

What does the “Accelerator scalar kernel generated” mean as compared to the “Accelerator kernel generated”?
Does this mean a non-parallel kernel is being created as well?

Because you’re arrays are allocatable, the compiler needs to update their array descriptors on the device. The “scalar kernel” is the generated kernel to update these descriptors.

-Mat