My question is pretty easy one, but one for which I can’t find answer in the official documentation.
We can have 3-dimensional thread blocks (x, y, z), the warps launch running 16 threads at a time.
How do these 16 are selected from the x,y,z parameters? My kernel is outputing 4 bytes at a time,
and I want to have perfect coalescence.
In my case I prefer division of the block into (4,4,6) to ease indexing, but I want to know order of
packing into 16 half warps.
Warp threads are ordered sequentially within the warp in column major order, and by extension the threads in any given half warp are sequential (which is what is important for memory coalescing). The order in which warps are executed within a given block, however, is completely at the whim of the scheduler.
I’m trying to understand how it relates to memory layout, but according to Page 20 in 2.3 programming cuide, matrices are stored in row-major order and it seems (according to my measurements) in 3D block are executed also in row major order.