Warp layout in a 2D thread block?

On a Fermi GPU, if each thread block has 16x16 threads, can anyone tell me how the 32 threads of a warp will be distributed? Will each warp cover two adjacent rows of 16 threads? That would seem logical, but I haven’t been able to find a definitive answer anywhere.

Further to this, obviously it will depend on the application somewhat, but in general on a Fermi GPU should blocks that are 32 threads wide give the best performance?

Thanks in advance.

16x16 = 256 threads per block, so occupancy is 100% per multi processor.

My guess is 16x16 is simply converted to 256x1, but this is indeed undocumented, perhaps because they might change how the hardware works in the future.

As long as 1536/256 = 6 blocks are present as a minimum (per multi-processor) then it should give max performance.

(at least 1536 is maximum ammount of threads per multi processor for my gt 520 gpu, what does yours say ? External Image)

According to the programming guide, it goes by x_index first, then y_index, then z_index. For the purposes of warp grouping threads don’t have 3 dimensional indices, they just go by 1. This index is given by threadId = threadIdx.x+blockDim.x*(threadIdx.y+blockDim.y*threadIdx.z). Every 32 threads of this index is a new warp.

These formula’s are not in “CUDA C Programming Guide Version 4.0”, if you believe otherwise please state which section ! External Image :)

I have seen one little formula in the guide though, for just 2 dimensions.

Even your formula is still missing the grid.

None-the-less thanks for the formula… it seems the shortest one so far.

I still have to test it and make sure it’s valid, but it seems valid to me External Image

Main reason I can see for needing this is to ensure contiguous memory access, Section 5.3.2.1.2 of the programming guide has this

Found this which describes it precisely

and further down

This is not the same formula as spadflyer12.

His formula moves the dimensions outside of the parenthesis ( ).

So his appears to be more efficient.