My question is pretty easy one, but one for which I can’t find answer in the official documentation.
We can have 3-dimensional thread blocks (x, y, z), the warps launch running 16 threads at a time.
How do these 16 are selected from the x,y,z parameters? My kernel is outputing 4 bytes at a time,
and I want to have perfect coalescence.
In my case I prefer division of the block into (4,4,6) to ease indexing, but I want to know order of
packing into 16 half warps.