Blocks and Warps

Hey guys

my question is about Blocks and Warps , i managed to understand that within each SM , under G8 for example , we have room for 8 blocks.
when each block is executed it’s basically splitted into warps when each wrap contains 32 Threads , tops.

according to all CUDA documintation , all threads within certain warp preform the same insturction.
my question is , how exactlly that warp is built and how can we be sure that each thread would execute the same instruction before we managed to get into “execution” mode ?

if ( C )

edit : my question is related also to the branch diverengece field.

thanks igal.

Warps are created from threads in “thread ID” order, which is explained in Section 2.2 of the CUDA Programming Guide. You do not need to worry about correct handling of branch instructions because the compiler takes care of that for you.

The first thing you need to understand is how warps and branch divergence work at the level of the GPU. You should read the following:

Coon, B. W. and J. E. Lindholm (2008). United States Patent #7,353,369: System and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008., U.S.P.T.O.

Fung, W. W. L., I. Sham, et al. (2007). Dynamic warp formation and scheduling for efficient gpu control flow, IEEE Computer Society.

In particular, the explanation of the example associated with Figures 6A, B, C in the patent is very helpful in understanding the basic automaton in an SM.

Once you understand that, you can then start trying to understand how things work at a higher level, like the CUDA API’s. None of the documents on CUDA from NVIDIA (i.e., CUDA Programming Guide, NVIDIA Compute, PTX: Parallel Thread Execution, Programming Massively Parallel Processors, CUDA by Example) seem to describe how things work at this level and only left me asking more questions.