I have some doubts about the performance of using multiDimensional blocks (i.e. blocksDim(n,m)) in order to obtain the profits of the Memory Coallescing technique.
I assume that Memory Coalescing implies that when a thread within a warp tries to read a memory address, Cache brings the 31 consecutive positions of memory in order to allow to other threads in the warp to hit the access. When you have 1D blocks which threads access are consecutives (i,i+1,i+2… i+31) it works fine
// Option a
// Option b
// Important to take into account that dimY=blocksDim.x*gridDim.x=1024
Which is the best option, a) or b)… I mean, when accessing to a[idx+idy*dimX], the warp contains the 31 next threads along x dimension or along y dimension?
Section 4.1 page 61 says: “The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Section 2.2 describes how thread IDs relate to thread indices in the block.”
And Section 2.2 page 8 says: “The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).”