When I use
openacc loop vector tile in Fortran, how are the nested loops linearized into warps?
For example, if I have:
!$acc parallel num_gangs(zmax)
!$acc loop gang
do k = 1, zmax
!$acc loop vector tile(xmax,ymax)
do j = 1, ymax
do i = 1, xmax
will threads with consecutive
i values be placed in the same warp? Or will threads with consecutive
j values be placed in the same warp?
I’m trying to fix some non-coalesced memory accesses in my app, and it seems like understanding this might help.
I also have the same question about multidimensional blocks in CUDA Fortran.
In this case “xmax” will map to the CUDA blockdim%x (x dimension block size) while “ymax” will be the blockdim%y. So yes, the “i” loop will be grouped into a warp across threadidx%x.
You might try using a tile size of 32x4 (or other multiples of 32 up to a max product of 1024). In this case you’d have 4 warps each processing multiple iterations of the “i” loop rather than ymax groups each processing a single iteration of the “i” loop. Granted it may not matter, but if xmax and ymax are odd numbers and/or not divisible by 32 you may be wasting threads and lower your occupancy.