I appreciate that this may be a naive question, but I have so far not been able to find the answer in the
manual or on-line.
I have an Nvidia GTX285 graphic card.
I am looking at trying to assess the efficiency of, and optimise code but one problem I have is
conditional branching. In fact worse than that - condtional looping - where some threads will iterate more
loops than others. However, the following occured to me:
- A Warp is 32 threads running in SIMD (OR SIMT) and therefore conditional branching is slow as all
threads take all paths.
-
A GTX285 - I believe each SM has 8 single precision units and one double precision.
-
My code is mostly double precision.
So this means that in single precision only 8 threads of the warp can actually be running at once and in
double precision only one! So what if one block of 8 threads all takes the same branch, even if other
threads in the warp take a difference branch. Does the SIMD execution only apply to those 8 or will they
still have to execute both branches?
Furthermore as I’m mosstly working double precision, if the former is the case then surely there is no
SIMD at all and then the branching should not matter much.
So the basic question is whether the lock-step instructions apply over the full warp regardless of the
actual hardware or whether they apply only to threads running? I’m guessing its the latter as then the
logic is independent of the hardware - but it seems a waste to be running lock-step over 32 threads with
one execution unit.
Secondly I am dealing with quite a few variables for each thread so I imagine that I dont want to have too many warps (or blocks) assigned to an SM at one time or it will be spilling over. I caluculate that at the maximum of 32 warps per SM that only gives me 16 bytes per thread - 2 doubles! Will it automatically assign as many warps as possible? Is there any way I can control this?
Many thanks for any answers.