Multi-dimentional blocks and warps Attempts to achieve perfect coalescence.

My question is pretty easy one, but one for which I can’t find answer in the official documentation.

We can have 3-dimensional thread blocks (x, y, z), the warps launch running 16 threads at a time.
How do these 16 are selected from the x,y,z parameters? My kernel is outputing 4 bytes at a time,
and I want to have perfect coalescence.

In my case I prefer division of the block into (4,4,6) to ease indexing, but I want to know order of
packing into 16 half warps.

Thanks.

Warp threads are ordered sequentially within the warp in column major order, and by extension the threads in any given half warp are sequential (which is what is important for memory coalescing). The order in which warps are executed within a given block, however, is completely at the whim of the scheduler.

Thanks a lot!

I have some weird feeling that CUDA 2.3 profiler on Linux is not showing real coalescence numbers.

I’m trying to understand how it relates to memory layout, but according to Page 20 in 2.3 programming cuide, matrices are stored in row-major order and it seems (according to my measurements) in 3D block are executed also in row major order.

Hm…

All C storage is row major order and CUDA is no different. Within a thread block the x dimension is the fastest varying, then y, then z.

(Programming Guide v2.2, p.8)