Warp formation of small multidimensional blocks

Hi all,

I have a small question… If I create, say, a two-dimensional block of size <16, 2>, then the 16 threads for y=0 and the 16 for y=1 form one warp or two independent? I am aware of the fact that block sizes that are a multiple of warp size make better utilization of the SMs, but in my application I have decided that it might be better to not follow this approach for some problem sizes. Additionally, for even smaller blocks, let’s say <8, 2>, do the coalescing rules for global and shared memory accesses apply for the 16 threads as a half-warp or for each 8-tuple of threads independently?


That is one warp. Block dimensions are just a language level device - threads are sequentially ordered in column major order within a block, and warps are formed in order from the sequence of threads.

The warp and half-warp “rules” of the execution model are invariant. If you have less than 32 threads per block, the hardware just adds dummy threads which are masked out and runs a single 32 thread warp. If you choose less than 32 threads per block, all you are doing is wasting cycles.