warps in openacc tile for Fortran

When I use openacc loop vector tile in Fortran, how are the nested loops linearized into warps?

For example, if I have:

!$acc parallel num_gangs(zmax) 
!$acc loop gang 
do k = 1, zmax 
  !$acc loop vector tile(xmax,ymax)
  do j = 1, ymax 
    do i = 1, xmax

will threads with consecutive i values be placed in the same warp? Or will threads with consecutive j values be placed in the same warp?

I’m trying to fix some non-coalesced memory accesses in my app, and it seems like understanding this might help.

I also have the same question about multidimensional blocks in CUDA Fortran.

Hi Ron,

In this case “xmax” will map to the CUDA blockdim%x (x dimension block size) while “ymax” will be the blockdim%y. So yes, the “i” loop will be grouped into a warp across threadidx%x.

You might try using a tile size of 32x4 (or other multiples of 32 up to a max product of 1024). In this case you’d have 4 warps each processing multiple iterations of the “i” loop rather than ymax groups each processing a single iteration of the “i” loop. Granted it may not matter, but if xmax and ymax are odd numbers and/or not divisible by 32 you may be wasting threads and lower your occupancy.

-Mat