I am teaching a course about CUDA and I have sent the following message to the “Teaching and Curriculum Support” forum here on devtalk. Unfortunately I haven’t received any answer yet, as that forum seems to be less active than others. Please forgive me for cross-posting, but I would really appreciate an explanation to include it in my slides.
Thank you!
I am updating my slides regarding the NVidia architectures and I am adding information about Pascal. I noticed that in Pascal every SM is divided into two processing blocks. Each processing block has 32 SPs, one warp scheduler and two dispatch units.
Now, I understand the two dispatch units per warp scheduler in Kepler and Maxwell, but not in Pascal. Each processing block has 32 SPs which is the size of the warp. Why do we need two dispatch units? Where is the second instruction executed?
Couldn’t find anything about it in the Pascal white paper or on the Internet. Everyone just mentions that there are two dispatch units, but not why.
The architecture you’re describing is the P100 (SM 60) architecture. It’s slightly different for SM 61 architectures (e.g. GTX 1080).
A dispatch unit dispatches an instruction, to resources that are appropriate for processing that instruction.
The 32 SP units are 32-bit floating point ALUs which can perform a 32-bit floating point add, multiply, or multiply-add. There are many other instructions which can be issued (e.g. load/store instructions) which are not handled by these SP units but by other functional units inside the SM.
So an SM 60 SM could issue a single precision floating point instruction and a memory load instruction in the same cycle (from 2 different dispatch units). These instructions would be issued to different functional units in the SM, and would not conflict (from an issue standpoint).
There are a variety of functional units in an SM, and not all are covered in documentation (e.g. whitepapers) or functional block diagrams.
Note that conceptually, my discussion here is not limited to Pascal architecture. All CUDA GPUs have separate functional units for different types of operations (e.g. LD/ST vs. compute) and some CUDA GPUs since the Fermi generation have been capable of dual-issue.
“So an SM 60 SM could issue a single precision floating point instruction and a memory load instruction in the same cycle (from 2 different dispatch units). These instructions would be issued to different functional units in the SM, and would not conflict (from an issue standpoint).”
Are these 2 instructions from the same warp or from 2 separate warps?
I mentioned “from 2 different dispatch units” because I wanted to cover the general case of the instructions being dispatched from separate instruction streams. If the instructions are to be dispatched from the same instruction stream, this would be a case of ILP/dual-issue and has I believe a more restrictive set of conditions under which it can happen, one of which (I think) must be that they must issue from the same (dual-issue-capable) dispatch unit.
I’m at the edge of my knowledge here, so I will refer you to this writeup by Greg:
This shows two warp schedulers, each with its own registers and queue of pending warps. It also shows each warp scheduler has dual dispatch ability, mostly to allow a simultaneous ALU computation and a LD/ST operation.
Other Pascal SMs and all Maxwell SMs are very similar, just twice as wide with four schedulers, not two like GP100. Kepler is different and (surprisingly) much more complex.
Should this be surprising? My assumption (I was not involved in the design) is that the reduced complexity of the Maxwell and Pascal architectures is part of the reason these architectures are more power efficient and/or provide higher of perf/mm² of die area than Kepler. Control structures like schedulers can be surprisingly power hungry.
njuffa, sm 2.1/3.0 had 1.5 alus/warp, i.e. one of two warps can issue 2 arithmetic operations in a cycle. unfortunately, many programs lack enough parallelism (take into account that next dependent instructions can be issued only 10 cycles later!) and moreover, at least kepler has too low register ports to serve 2 alu operations (afair it has 4 register ports per warp while SGEMM f.e. require 3 different registers for each alu operation)
so essentially 1/3 of kepler resources were rarely used, helping only to beat AMD in theoretical peak perfromance (it was in pre-GCN times when AMD outperfromed NVidia in the peak perfromance specs)
so nvidia cut just cut out those 1/3 of alus, and nvidia reported that new SM had 90% perfromance of the old one
Kepler FMA can also take one operand directly from the constant cache. Together with the 4 register ports you mentioned this indeed allows two FMA operations to be issued per cycle for the same warp. So peak arithmetic throughput is achievable, but I guess Nvidia had overestimated the practical use of multiplying by a constant in the pre-Maxwell era.
According to this diagram the LD/ST instructions does not go through L1 cache. Is that correct? Because my understanding was that the LD/ST instructions go through the L1 cache, which is responsible for coalescing the accesses (from all threads in the warp) and then get data from the global memory.
As compute capability 1.x devices have demonstrated, you don’t need a cache to coalesce memory accesses. This level of detail is not (publicly) documented by Nvidia. It doesn’t make a significant difference to programming, so I am inclined not to care.