Warp Size & Block Dim performance

Hi everybody,

I have some doubts about the performance of using multiDimensional blocks (i.e. blocksDim(n,m)) in order to obtain the profits of the Memory Coallescing technique.

I assume that Memory Coalescing implies that when a thread within a warp tries to read a memory address, Cache brings the 31 consecutive positions of memory in order to allow to other threads in the warp to hit the access. When you have 1D blocks which threads access are consecutives (i,i+1,i+2… i+31) it works fine



But working with matrix, what about 2D blocks?







// Option a


// Option b


// Important to take into account that dimY=blocksDim.x*gridDim.x=1024

Which is the best option, a) or b)… I mean, when accessing to a[idx+idy*dimX], the warp contains the 31 next threads along x dimension or along y dimension?


I think u can check the ptx code generated and see if they do the same thing

Thread are gathered in warps using indexes in the x dimension first, then y, then z. So your option a should lead to coalesced memory accesses whereas your option b shouldn’t.

Perfect!!! That was what i wanna know!!

Have you got any reference to support it?

@YMC, I was trying to know what @Giles_C answered. I wasn’t sure about how warps are conformed (if x, y or z…)

Thank you Giles_C and YMC!


CUDA_C_Programming_Guide Version 4.2

Section 4.1 page 61 says: “The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Section 2.2 describes how thread IDs relate to thread indices in the block.”

And Section 2.2 page 8 says: “The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).”

Hence my comment!