Warp layout in a 2D thread block?

lhw1 · July 20, 2011, 1:36pm

On a Fermi GPU, if each thread block has 16x16 threads, can anyone tell me how the 32 threads of a warp will be distributed? Will each warp cover two adjacent rows of 16 threads? That would seem logical, but I haven’t been able to find a definitive answer anywhere.

Further to this, obviously it will depend on the application somewhat, but in general on a Fermi GPU should blocks that are 32 threads wide give the best performance?

Thanks in advance.

Skybuck · July 20, 2011, 3:30pm

16x16 = 256 threads per block, so occupancy is 100% per multi processor.

My guess is 16x16 is simply converted to 256x1, but this is indeed undocumented, perhaps because they might change how the hardware works in the future.

As long as 1536/256 = 6 blocks are present as a minimum (per multi-processor) then it should give max performance.

(at least 1536 is maximum ammount of threads per multi processor for my gt 520 gpu, what does yours say ? External Image)

spadflyer12 · July 20, 2011, 9:32pm

According to the programming guide, it goes by x_index first, then y_index, then z_index. For the purposes of warp grouping threads don’t have 3 dimensional indices, they just go by 1. This index is given by threadId = threadIdx.x+blockDim.x*(threadIdx.y+blockDim.y*threadIdx.z). Every 32 threads of this index is a new warp.

Skybuck · July 21, 2011, 4:16am

These formula’s are not in “CUDA C Programming Guide Version 4.0”, if you believe otherwise please state which section ! External Image :)

I have seen one little formula in the guide though, for just 2 dimensions.

Even your formula is still missing the grid.

None-the-less thanks for the formula… it seems the shortest one so far.

I still have to test it and make sure it’s valid, but it seems valid to me External Image

kbam · July 21, 2011, 4:49am

Main reason I can see for needing this is to ensure contiguous memory access, Section 5.3.2.1.2 of the programming guide has this

5.3.2.1.2 Two-Dimensional Arrays

A common global memory access pattern is when each thread of index (tx,ty) uses the following address to access one element of a 2D array of width width, located at address BaseAddress of type type* (where type meets the requirement described in Section 5.3.2.1.1):

BaseAddress + width * ty + tx

For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size (or only half the warp size for devices of compute capability 1.x).

In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch() and cuMemAllocPitch() functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.

kbam · July 21, 2011, 5:18am

Found this which describes it precisely

and further down

global void MatAdd(float A[N][N], float B[N][N], float C[N][N])

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}

Skybuck · July 21, 2011, 9:32am

This is not the same formula as spadflyer12.

His formula moves the dimensions outside of the parenthesis ( ).

So his appears to be more efficient.

Topic		Replies	Views
Relationship between Thread Block dimension and warps CUDA Programming and Performance cuda , kernel	4	572	April 22, 2024
Thread to warp assignement How block's threads get mapped to warps? CUDA Programming and Performance	4	7946	January 28, 2008
Warp scheduler and dimensionality CUDA Programming and Performance	6	1061	January 20, 2015
Thread Block Shape Versus Performance Choosing proper Thread Block Shape CUDA Programming and Performance	6	6975	May 23, 2007
Registers per SM GTX 460 CUDA Programming and Performance	7	1912	April 17, 2011
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8511	March 28, 2008
Grouping of threads into warps CUDA Programming and Performance	1	3336	February 25, 2009
Huge data structures CUDA Programming and Performance	9	6105	February 7, 2012
quick question: warp mapping for 2-D kernel CUDA Programming and Performance	2	3824	April 15, 2009
deviceQuery CUDA Programming and Performance	2	6069	March 29, 2010

Warp layout in a 2D thread block?

Related topics