Stupid (?) questions about Warp vs. Half Warp vs. SM width

I’m currently researching G80/GT200 architecture in detail. What seems to be a bit mysterious to me is the Warp size. An SM in a G80/GT200 GPU contains 8 SPs. Naturally, that would lead to a size of 8 for the smallest scheduling unit. I’ve yet to see a good explanation why the Warp size is 32. And furthermore there’s the strange Half Warp (which only is relevant for load/store coalescing, it seems, but who knows). I’ve seen a few explanation attempts, but none were convincing and sometimes they contradicted each other. I’ve read that

  • Instructions are only processed at half the SP’s clock speed and thus 2 instructions are fed to the SPs at the same time to get full throughput, leading to a Half Warp size of 16. And “to keep the pipelines full” (which is not a very good explanation, to be honest) two of the same instructions are fed to SPs, leading to a Warp size of 32.
    Does the dual-issue capability of the G80/GT200 (via SFU) play a role in this regard?

Alternatively,

  • 4x8 instructions of a Warp are interleaved (how?) and executed in 4 cycles (why? again, pipelining?)

Is anyone able to clear all this up a bit?

I’m currently researching G80/GT200 architecture in detail. What seems to be a bit mysterious to me is the Warp size. An SM in a G80/GT200 GPU contains 8 SPs. Naturally, that would lead to a size of 8 for the smallest scheduling unit. I’ve yet to see a good explanation why the Warp size is 32. And furthermore there’s the strange Half Warp (which only is relevant for load/store coalescing, it seems, but who knows). I’ve seen a few explanation attempts, but none were convincing and sometimes they contradicted each other. I’ve read that

  • Instructions are only processed at half the SP’s clock speed and thus 2 instructions are fed to the SPs at the same time to get full throughput, leading to a Half Warp size of 16. And “to keep the pipelines full” (which is not a very good explanation, to be honest) two of the same instructions are fed to SPs, leading to a Warp size of 32.
    Does the dual-issue capability of the G80/GT200 (via SFU) play a role in this regard?

Alternatively,

  • 4x8 instructions of a Warp are interleaved (how?) and executed in 4 cycles (why? again, pipelining?)

Is anyone able to clear all this up a bit?

I don’t believe anyone can give you a solid, quantitative answer as to why 32 was selected, but there are certainly good conjectures (that you allude to) for why NVIDIA would make the warp size 32 and not 8:

With a warp larger than the number of CUDA cores, the instruction decoder logic can run at the lower core clock rate rather than the higher shader clock rate. The recent Anandtech review of the GTX 580 mentions that NVIDIA used two types on transistors on their chips before GF110: A slower, low leakage current design (good for power consumption), and a faster, high leakage current design (good for speed). A wide warp would allow them to use the slow transistor design for instruction decoding and save power.

Additionally, as a highly pipelined architecture, a CUDA multiprocessor trades latency for throughput. An instruction can take many clocks (I’ve seen estimates that this is > 20 for simple instructions) from start to finish, but pipelining ensures that in the compute capability 1.x chips, one simple instruction for an entire warp finishes every 4 clocks. The main implementation tradeoff with pipelining is inter-instruction dependency. Pipelining is a form of instruction-level parallelism, requiring instructions that do not depend on their immediate predecessors in order to be fully effective. On a single threaded architecture, this can be very complicated to ensure, especially in the face of branches. Branch prediction and instruction reordering is such a hot item in CPU designs because it helps ensure that you don’t have to flush the pipeline and lose all that throughput benefit. These features take up valuable chip area that could be spent on more arithmetic units. The longer the pipeline, the harder you have to work.

Because CUDA is basically SIMD programming wearing a different hat, you have a second source of parallelism to exploit in your pipeline. By construction, threads in a warp can’t depend on each other except for a handful of synchronization instructions. By putting 4 threads into the pipeline for each CUDA core, you buy 4 clocks of guaranteed independence, making your pipeline (from a management perspective) 1/4 as long.

Clearly, NVIDIA decided to get more agressive about the instruction-level parallelism in Fermi, because now there are 32 CUDA cores per MP, and two warps finish every two clocks. Interesting to note that they chose to group the CUDA cores into two sets of 16, each running a separate warp instruction. That means they still get some benefit from guaranteed thread independence in the pipeline, and might explain in part why they chose not to go with one warp every clock on all 32 CUDA cores at once. (Also note that this organization means that they have two instruction decoders which can each run slower rather than one very fast instruction decoder.)

(Interleaving 4 instructions from the same warp would be tremendously complicated and have little benefit, so I doubt that is the scheduling strategy.)

I don’t believe anyone can give you a solid, quantitative answer as to why 32 was selected, but there are certainly good conjectures (that you allude to) for why NVIDIA would make the warp size 32 and not 8:

With a warp larger than the number of CUDA cores, the instruction decoder logic can run at the lower core clock rate rather than the higher shader clock rate. The recent Anandtech review of the GTX 580 mentions that NVIDIA used two types on transistors on their chips before GF110: A slower, low leakage current design (good for power consumption), and a faster, high leakage current design (good for speed). A wide warp would allow them to use the slow transistor design for instruction decoding and save power.

Additionally, as a highly pipelined architecture, a CUDA multiprocessor trades latency for throughput. An instruction can take many clocks (I’ve seen estimates that this is > 20 for simple instructions) from start to finish, but pipelining ensures that in the compute capability 1.x chips, one simple instruction for an entire warp finishes every 4 clocks. The main implementation tradeoff with pipelining is inter-instruction dependency. Pipelining is a form of instruction-level parallelism, requiring instructions that do not depend on their immediate predecessors in order to be fully effective. On a single threaded architecture, this can be very complicated to ensure, especially in the face of branches. Branch prediction and instruction reordering is such a hot item in CPU designs because it helps ensure that you don’t have to flush the pipeline and lose all that throughput benefit. These features take up valuable chip area that could be spent on more arithmetic units. The longer the pipeline, the harder you have to work.

Because CUDA is basically SIMD programming wearing a different hat, you have a second source of parallelism to exploit in your pipeline. By construction, threads in a warp can’t depend on each other except for a handful of synchronization instructions. By putting 4 threads into the pipeline for each CUDA core, you buy 4 clocks of guaranteed independence, making your pipeline (from a management perspective) 1/4 as long.

Clearly, NVIDIA decided to get more agressive about the instruction-level parallelism in Fermi, because now there are 32 CUDA cores per MP, and two warps finish every two clocks. Interesting to note that they chose to group the CUDA cores into two sets of 16, each running a separate warp instruction. That means they still get some benefit from guaranteed thread independence in the pipeline, and might explain in part why they chose not to go with one warp every clock on all 32 CUDA cores at once. (Also note that this organization means that they have two instruction decoders which can each run slower rather than one very fast instruction decoder.)

(Interleaving 4 instructions from the same warp would be tremendously complicated and have little benefit, so I doubt that is the scheduling strategy.)