I’m a newbie for CUDA and have a little confusion about SM, SP, warp and threads.
From the lecture or guide of CUDA, a SM has 8 SP, a SP is corresponding to a single thread. So, a single instruction should intuitively be executed by 8 SPs, that is, SM should have 8 threads at a single clock. But the lecture also told us that a warp has 32 threads.
From my opinion, a instruction need 4 clocks to be finished, and a SM has 8 SPs. So, a warp has 4*8 = 32 threads for a single instruction. At a single clock, there’s 8 threads running, and the other 24 threads are buffered. 8 SPs are continually executing 1/4 instruction at other every clock.
That’s the relationship of the SP and the thread and the warp.
| 8 thread corresponding to 8 SPs | | a clock tick |
| 8 thread corresponding to 8 SPs | | a clock tick |
| 8 thread corresponding to 8 SPs | | a clock tick |
| 8 thread corresponding to 8 SPs | | a clock tick |
Is my understanding correct?
Let’s go further, the lecture told us there’re up to 768 threads can be executed by a SM. Why’s that?
Is there a buffer something with limited resource to be executed to 768 threads? Or something else?
Thanks.
Halbert.XIE