SIMD question Is the number of actual execution units relevant to a warp?

I appreciate that this may be a naive question, but I have so far not been able to find the answer in the

manual or on-line.

I have an Nvidia GTX285 graphic card.

I am looking at trying to assess the efficiency of, and optimise code but one problem I have is

conditional branching. In fact worse than that - condtional looping - where some threads will iterate more

loops than others. However, the following occured to me:

  1. A Warp is 32 threads running in SIMD (OR SIMT) and therefore conditional branching is slow as all

threads take all paths.

  1. A GTX285 - I believe each SM has 8 single precision units and one double precision.

  2. My code is mostly double precision.

So this means that in single precision only 8 threads of the warp can actually be running at once and in

double precision only one! So what if one block of 8 threads all takes the same branch, even if other

threads in the warp take a difference branch. Does the SIMD execution only apply to those 8 or will they

still have to execute both branches?

Furthermore as I’m mosstly working double precision, if the former is the case then surely there is no

SIMD at all and then the branching should not matter much.

So the basic question is whether the lock-step instructions apply over the full warp regardless of the

actual hardware or whether they apply only to threads running? I’m guessing its the latter as then the

logic is independent of the hardware - but it seems a waste to be running lock-step over 32 threads with

one execution unit.

Secondly I am dealing with quite a few variables for each thread so I imagine that I dont want to have too many warps (or blocks) assigned to an SM at one time or it will be spilling over. I caluculate that at the maximum of 32 warps per SM that only gives me 16 bytes per thread - 2 doubles! Will it automatically assign as many warps as possible? Is there any way I can control this?

Many thanks for any answers.

Correct, although it is easy to overestimate the significance of this. It is important to benchmark before doing to much branching optimization.

Correct.

This is not how the hardware works. The SPs are pipelined, so when the instruction scheduler selects a warp to run, all 32 threads of that warp are sent to the pipelines of the 8 SPs. You should not think of the SPs as being assigned to a thread for some duration of time. Instead, they see a constant flow of instructions from different threads which take many shader clock ticks to complete, with one warp finishing every 4 clock ticks.

The scheduler has to issue instructions for an entire warp at a time regardless of hardware configuration. (There is a slight exception to this on your device. Memory reads and writes were issued in half-warp units, but this was an aberration that does not persist into new cards.) The main reason for this is to save on transistors so that more chip area can be devoted to floating point units.

When you launch a kernel, you select the number of blocks and threads per block. The number of warps will be the number of threads per block / 32 (rounded up to nearest integer). You cannot overflow a SM with warps because the kernel will refuse to launch if your block size exceeds the capability of the device. If you make your block small enough, the hardware may decide to run multiple blocks on one SM at the same time, but you have no control over this.

OK, thanks for this comprehensive answer, I feel much clearer now.