how many threads concurrently run at a clock?

I’m a newbie for CUDA and have a little confusion about SM, SP, warp and threads.

From the lecture or guide of CUDA, a SM has 8 SP, a SP is corresponding to a single thread. So, a single instruction should intuitively be executed by 8 SPs, that is, SM should have 8 threads at a single clock. But the lecture also told us that a warp has 32 threads.

From my opinion, a instruction need 4 clocks to be finished, and a SM has 8 SPs. So, a warp has 4*8 = 32 threads for a single instruction. At a single clock, there’s 8 threads running, and the other 24 threads are buffered. 8 SPs are continually executing 1/4 instruction at other every clock.

That’s the relationship of the SP and the thread and the warp.

| 8 thread corresponding to 8 SPs | | a clock tick |

| 8 thread corresponding to 8 SPs | | a clock tick |

| 8 thread corresponding to 8 SPs | | a clock tick |

| 8 thread corresponding to 8 SPs | | a clock tick |

Is my understanding correct?

Let’s go further, the lecture told us there’re up to 768 threads can be executed by a SM. Why’s that?

Is there a buffer something with limited resource to be executed to 768 threads? Or something else?




Probably some limitation of the scheduling hardware or something. Note that compute 1.3 devices can actually handle 1024 threads: 768 is the limit for older hardware.

Thanks. Can you offer me some references which can explain this thing clearly?

Sure. This is copied and pasted from the FAQ.

Where can I find more information on NVIDIA GPU architecture?

  • J. Nickolls et al. “Scalable Programming with CUDA” ACM Queue, vol. 6 no. 2 Mar./Apr. 2008 pp 40-53

    <a target='_blank' rel='noopener noreferrer' href='""'></a>
    • E. Lindholm et al. “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro, vol. 28 no. 2, Mar.Apr. 2008, pp 39-55