How are threads in 2D work unit mapped to warps? Arrangement/order of thread execution in 2D work un


How do threads in a single 2D work unit map to warps and/or half-warps?

To illustrate, imagine that work unit is 8x4 and warp size is 4 (so there’s 8 warps per work unit).

Is it like this (numbers denote number of warp):


or like this:



If and how can I influence that (short of using 1D work units and doing the mapping myself)?


warps depend on the gpu you have, if you run device query you can see the warp size listed. warp is the min number of threads(32) that a multi processor will run. so your threads or work-items should be a multiple of the warp size. see 1.1BestPracticesGuide.pdf

I understand that, but my question is different. For a given work-item (say, consisting of 16x16 (256 threads, a multiple of warp size (32))), there are many ways in which warps may be arranged within work item. To name a few: 8 warps of 16x2 threads or 8 warps of 8x4 threads or 8 warps of 4x8 warps, etc. Is it possible to know/influence that arrangement?

I’m asking that because for my kernel, blocks of 4x8 or 8x4 threads in a warp are more likely to take the same branch, while warps arranged in a way like 16x2 should have more divergent threads. However, I can’t just reduce work item size, because smaller work items perform worse (probably there’s just not enough occupancy).

I tried running 1D grid and mapping threads to pixels myself, but it involves some overhead (with mults/divs in general case), so it’s not a perfect solution.

i honestly dont know if it can be done, since openCL gives you the logical abstration of work-gp and wrk-items to allocate resources. but warps are closer to hardware. today on nvidia gpu is is 32, but tomorrow it could be something else. So a cross gpu solution would probably hide that.

But hold on, i think there is an example of how warp size was used for an example in the sdk.

update: could you see oclHistogram example? there is a pdf oclHistogram.pdf in which for a 256bin histogram it talks about consideration using the warp size. there is even a diagram #3 with the warp illustration. perhaps you can check it out and enlighten us?

as far as I understood, warps are formed so that, in case of 32 threads warps, 32 threads having consecutive ids are in a same warp and the first warp start for the id0.
In on dimensional kernels it si quite easy to understand.
In “OpenCL Programming Guide for the CUDA Architecture” version 2.3 it is written in section 2.1.1 :
“A thread is also given a unique thread ID within its block. The local ID of a thread
and its thread ID relate to each other in a straightforward way: For a one-
dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),
the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block
of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is
(x + y Dx + z Dx Dy).”
So if I understand, the ith warp contains threads(lid0,lid1) so that lid1localSize0+lid0 is in [32(i-1),32*i-1], where lid0 and lid1 are local ids.
I organized my programs so and it seemed to work.
Does it answer to your question?