Thread to warp assignement How block's threads get mapped to warps?

I have a question that I couldn’t answer myself from reading the manual.

How do the threads inside a thread block get assigned to half warps ?

The simple situation (a block = 16 threads) is clear : each block goes as-is to a halfwarp, several blocks are grouped together to fill the up to 768 threads that a processor can handle, as long as the shared memory suffice.

Now what happens with a 2D block that contains 16x4 threads ?
Do each row get assigned as a half-warp ? 4 grouped by packs of 4 row to fill up the maximum parallel thread capacity ?
Or are the threads assigned by groups of 4 column (4 x 4 = 16) ?
Or some other algorithm for keeping neighbouring threads together ?

Or did I completely fail to understand how threads are organised on the CPU ?

Thank you a lot !

You’ll never fill up all 768 threads with half-warps. The device executes whole warps as the smallest unit of execution. Half-warps only come into play with shared memory banks.

To answer your other question, the manual says:
“The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Section 2.2.1 describes how thread IDs relate to thread indices in the block.” (section 3.2)

“Each thread is identified by its thread ID, which is the thread number within the block. To help with complex addressing based on the thread ID, an application can also specify a block as a two- or three-dimensional array of arbitrary size and identify each thread using a 2- or 3-component index instead. For a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y Dx) and for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).” (section 2.2.1)

So, threads with consecutive thread ids will be merged into warps.

However, while “the manual says so” is often used as the final word, some tests seem to suggest that this thread-warp merging is not the case:

Specifically, look at the timings for the 64x4 and the 4x64 blocks sizes. If consecutive threads are indeed merged into warps, then these two sizes should have the same timings. The results there seem to indicate that threads are not merged into warps across multiple rows of the block.

Though, because this test uses a texture read, the performance differences in 4x64 vs 64x4 may be due to the 2D texture cache… but I don’t find it likely.

That table does make it look like you should always set blockDim.x to a multiple of the warp size, whereas blockDim.y doesn’t matter.

Have you tried it out with a kernel that doesn’t use textures, and always writes non-coalesced?

If you follow the “row-major” order for 2D and 3D arrays, you can figure out the 1D thread IDs for cases when threadblocks are 2 or 3D. Then, starting with 0, consecutive 32 threads are treated as a warp. A warp is made up of two half-warps - the “lower” and “upper” 16 threads.