SIMD question Is the number of actual execution units relevant to a warp?

plonker13 · March 29, 2012, 8:52am

I appreciate that this may be a naive question, but I have so far not been able to find the answer in the

manual or on-line.

I have an Nvidia GTX285 graphic card.

I am looking at trying to assess the efficiency of, and optimise code but one problem I have is

conditional branching. In fact worse than that - condtional looping - where some threads will iterate more

loops than others. However, the following occured to me:

A Warp is 32 threads running in SIMD (OR SIMT) and therefore conditional branching is slow as all

threads take all paths.

A GTX285 - I believe each SM has 8 single precision units and one double precision.
My code is mostly double precision.

So this means that in single precision only 8 threads of the warp can actually be running at once and in

double precision only one! So what if one block of 8 threads all takes the same branch, even if other

threads in the warp take a difference branch. Does the SIMD execution only apply to those 8 or will they

still have to execute both branches?

Furthermore as I’m mosstly working double precision, if the former is the case then surely there is no

SIMD at all and then the branching should not matter much.

So the basic question is whether the lock-step instructions apply over the full warp regardless of the

actual hardware or whether they apply only to threads running? I’m guessing its the latter as then the

logic is independent of the hardware - but it seems a waste to be running lock-step over 32 threads with

one execution unit.

Secondly I am dealing with quite a few variables for each thread so I imagine that I dont want to have too many warps (or blocks) assigned to an SM at one time or it will be spilling over. I caluculate that at the maximum of 32 warps per SM that only gives me 16 bytes per thread - 2 doubles! Will it automatically assign as many warps as possible? Is there any way I can control this?

Many thanks for any answers.

seibert · March 29, 2012, 4:17pm

Correct, although it is easy to overestimate the significance of this. It is important to benchmark before doing to much branching optimization.

Correct.

This is not how the hardware works. The SPs are pipelined, so when the instruction scheduler selects a warp to run, all 32 threads of that warp are sent to the pipelines of the 8 SPs. You should not think of the SPs as being assigned to a thread for some duration of time. Instead, they see a constant flow of instructions from different threads which take many shader clock ticks to complete, with one warp finishing every 4 clock ticks.

The scheduler has to issue instructions for an entire warp at a time regardless of hardware configuration. (There is a slight exception to this on your device. Memory reads and writes were issued in half-warp units, but this was an aberration that does not persist into new cards.) The main reason for this is to save on transistors so that more chip area can be devoted to floating point units.

When you launch a kernel, you select the number of blocks and threads per block. The number of warps will be the number of threads per block / 32 (rounded up to nearest integer). You cannot overflow a SM with warps because the kernel will refuse to launch if your block size exceeds the capability of the device. If you make your block small enough, the hardware may decide to run multiple blocks on one SM at the same time, but you have no control over this.

plonker13 · March 30, 2012, 9:20am

OK, thanks for this comprehensive answer, I feel much clearer now.

Topic		Replies	Views
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8321	September 11, 2009
Warp threads execution model CUDA Programming and Performance	8	2770	January 19, 2010
Thread Scheduling Concept CUDA Programming and Performance	3	3709	June 21, 2012
questions about sp and sm CUDA Programming and Performance	5	4010	June 19, 2019
Basic question about warps CUDA Programming and Performance	14	6585	June 9, 2009
Inquisitive about SP cores in SMs CUDA Programming and Performance	3	1406	October 1, 2009
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24849	September 6, 2009
size of SIMD unit CUDA Programming and Performance	2	3926	December 22, 2009
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5563	July 28, 2009
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2818	March 2, 2009

SIMD question Is the number of actual execution units relevant to a warp?

Related topics