Any need to revise the principle "Threads in a half-warp are SIMT synchronous" ?

Vectorizer · July 30, 2013, 3:26pm

This is what prompted me to ask this question.
For CC2.1 each MP has 48 cores and warp scheduler can issue an instruction to only half of the CUDA cores.To execute an instruction for all threads of a warp, a warp scheduler must therefore issue the instruction over two clock cycles…
Given this, are the low 24 and high 8 threads simt synchronous or is it still 16 and 16?

seibert · July 30, 2013, 3:50pm

I understood the behavior of CC2.1 differently. On compute capability 2.x, a warp of 32 threads is issued to a group of 16 CUDA cores, which has a 10-20 stage pipeline (never quite sure how long this was) that finishes a warp every 2 shader clock cycles.

The SM has two warp dispatchers, so on compute capability 2.0, you get full utilization of all 32 CUDA cores if there are always two available warps. On compute capability 2.1, full utilization requires two available warps and two independent instructions in one of those warps. Then 3 warp instructions can be issued at once, using all 48 CUDA cores. That can’t always happen, so sometimes the extra 16 CUDA cores sit idle.

However, I might have those details slightly wrong because they are not formally documented by NVIDIA and instead filtered through gamer hardware review sites that garble the information a bit.