768 threads vs warp


Well I am a little bit confused when it comes to the following facts;

1- A Streaming Multiprocessor has 8 scalar processors. so how can a warp having 32 threads run simultaneously? Does’nt that means 4 threads run simultaneously on a single scalar processor.

2- What is 768 threads then? It is mentioned at several places that each SM can accommodate up to 768 threads. But then what is warp?

3-My question is if I ask you what is the Maximum Number of threads that can run simultaneous on a single SM, what will be your answer 32 (one warp ) OR 768?

I think the number 768 is the scheduling queue length of a single multiprocessor (compute capability 1.1) which can accomodate up to 3 thread blocks of 256 threads simultaneously. The scheduling is done in terms of warps, so there are 24 scheduling “slots” available in the hardware (24*32 = 768). I believe compute 1.2 and later devices can handle more than 3 thread blocks simultaneously. I have no idea if the also extended the length of the scheduling queue.

The 32 threads of a warp execute in a quasi-parallel fashion on 8 “cuda processors” (ALUs) which takes 4 clock cycles or more (depending on the type of instruction that is being executed).

Fell free to correct me if I’ve misstated some fact.

  1. Basically, it takes 4 cycles for the SM to issues an instruction to its 8 SPs. So, to keep the hardware saturated, each SP will apply this instruction to 4 threads in sequence. They don’t physically run simultaneously on the SPs but in a sort of a pipeline instead.

  2. 768 threads is 24 warps. Up to 24 (32 in newer hardware) warps can be issued to a SM in any given moment (given enough resources). Those warps can be from different blocks.

  3. That depends on your definition of “simultaneous”, as weird as that may sound.
    An SM is free to switch to any of the queued warps at any time and does that to hide latencies. In this perspective, there are 768 threads per SM “in-flight”, with allocated registers, ready to be processed. You could say they are being worked on simultaneously.
    During any given 4 cycles, the SM issues an instruction to a single warp. In this perspective, 32 threads are being worked on simultaneously by the 8 SPs in a pipeline.
    Each cycle, 8 results of a single instruction “pop out” of the 8 SPs. In this perspective, 8 threads simultaneously.

The answer I’d give is 32 but all of the 768 threads are indeed used. The effective throughput increases thanks to having those 24 warps scheduled as compared with only having the one that’s executing.