NVIDIA Exascale architecture

NVIDIA’s Bill Dally gave a talk last week at SC10 about how to build supercomputers at the Exascale level… what would it take to build an ExoFlop system? That’s almost 2 orders of magnitude more powerful than the best supercomputers today. His talk gets down into very interesting details both at the system level and the low level GPU chip level. A lot of the facts are well known to everyone (power use dominates almost all design decisions) but there’s lots of interesting thinking about different ways to minimize it. For example memory caches can be used not with the goal of speeding anything up, but to provide a more local memory reads which uses 50pJ of energy as opposed to an offchip DRAM read using 5nJ, an order of magnitude more.

There’s also a rough GPU design for an approach how to get multiple order of magnitude speedup over today’s design… it’s not specifically about Kepler or Maxwell, but obviously there’s a lot of similarity to the way even Fermi works now. The theoretical GPU design has a goal of 128 SMs with 8SPs each (which is more like GT200’s similar 8SP/SM than GF100’s 32 or 48 SP/SM). What is interesting in the new (hypothetical!) design is that each SM has four DP FMA engines, not one like current GPUs. (This actually makes sense with the facts from earlier in the presentation showing FP compute is not the bottleneck, that uses a lot less power than data transfer does.) I’m not sure how that would be implemented… new PTX instructions which allow choosing all 16 (!) arguments for the FMAs (!), some super-clever 4-deep ILP, or the simplest but least versatile vector SIMD like approach (that AMD still uses in its GPUs now, and NVIDIA left behind back with G80)

The presentation slides are here… the second half is the most interesting part.

NVIDIA’s Bill Dally gave a talk last week at SC10 about how to build supercomputers at the Exascale level… what would it take to build an ExoFlop system? That’s almost 2 orders of magnitude more powerful than the best supercomputers today. His talk gets down into very interesting details both at the system level and the low level GPU chip level. A lot of the facts are well known to everyone (power use dominates almost all design decisions) but there’s lots of interesting thinking about different ways to minimize it. For example memory caches can be used not with the goal of speeding anything up, but to provide a more local memory reads which uses 50pJ of energy as opposed to an offchip DRAM read using 5nJ, an order of magnitude more.

There’s also a rough GPU design for an approach how to get multiple order of magnitude speedup over today’s design… it’s not specifically about Kepler or Maxwell, but obviously there’s a lot of similarity to the way even Fermi works now. The theoretical GPU design has a goal of 128 SMs with 8SPs each (which is more like GT200’s similar 8SP/SM than GF100’s 32 or 48 SP/SM). What is interesting in the new (hypothetical!) design is that each SM has four DP FMA engines, not one like current GPUs. (This actually makes sense with the facts from earlier in the presentation showing FP compute is not the bottleneck, that uses a lot less power than data transfer does.) I’m not sure how that would be implemented… new PTX instructions which allow choosing all 16 (!) arguments for the FMAs (!), some super-clever 4-deep ILP, or the simplest but least versatile vector SIMD like approach (that AMD still uses in its GPUs now, and NVIDIA left behind back with G80)

The presentation slides are here… the second half is the most interesting part.

Thanks for the link. This architecture seems more or less in-line with the Stream Processors that Bill Dally’s team have been designing in the past.

AMD’s instruction set is a 5-way VLIW, and can execute 5 completely different instructions in parallel (at least for now, as a move to 4-way VLIW is rumored). Actually, slide 39 looks very much like AMD’s architecture, with a hierarchical register file. On AMD, the “main registers” part is split into 4 banks (X,Y,Z,W), and each bank can support 4 register reads, for a total of 16 (32-bit) input operands.

In the AMD case, the “operand registers” part only contains the results of the previous instruction bundle (of the same thread), and is more like an explicit bypass network. Here we probably have a small but real register file, which would explain the relatively low bandwidth of the main register file.

The most interesting part is probably the “L0 I$” block. It suggests that different lanes would be able to execute different instructions…

That and “The Real Challenge is Software” ;)

Thanks for the link. This architecture seems more or less in-line with the Stream Processors that Bill Dally’s team have been designing in the past.

AMD’s instruction set is a 5-way VLIW, and can execute 5 completely different instructions in parallel (at least for now, as a move to 4-way VLIW is rumored). Actually, slide 39 looks very much like AMD’s architecture, with a hierarchical register file. On AMD, the “main registers” part is split into 4 banks (X,Y,Z,W), and each bank can support 4 register reads, for a total of 16 (32-bit) input operands.

In the AMD case, the “operand registers” part only contains the results of the previous instruction bundle (of the same thread), and is more like an explicit bypass network. Here we probably have a small but real register file, which would explain the relatively low bandwidth of the main register file.

The most interesting part is probably the “L0 I$” block. It suggests that different lanes would be able to execute different instructions…

That and “The Real Challenge is Software” ;)

I think that it is significantly different than a stream processor. The memory organization seems similar, but those are caches, not local memories with DMA engines.

Did Bill mention what he meant by latency processors?

Edit: Also, just so that people have a context for this project. DARPA gave four companies a chance to build a prototype system that could scale up to an exaflop under a specific power budget. There will be prototype designs from the following companies:

  1. Intel

  2. Nvidia

  3. Massachusetts Institute of Technology CSAI Lab

  4. Sandia National Laboratory

Usually DARPA projects involve multiple companies building prototypes. Eventually one or two will be selected to do a complete design. This presentation describes NVIDIA’s prototype. The last project like this was PCA (I think). It produced RAW (Tilera), TRIPS, Monarch (Raytheon), and IRAM.

I think that it is significantly different than a stream processor. The memory organization seems similar, but those are caches, not local memories with DMA engines.

Did Bill mention what he meant by latency processors?

Edit: Also, just so that people have a context for this project. DARPA gave four companies a chance to build a prototype system that could scale up to an exaflop under a specific power budget. There will be prototype designs from the following companies:

  1. Intel

  2. Nvidia

  3. Massachusetts Institute of Technology CSAI Lab

  4. Sandia National Laboratory

Usually DARPA projects involve multiple companies building prototypes. Eventually one or two will be selected to do a complete design. This presentation describes NVIDIA’s prototype. The last project like this was PCA (I think). It produced RAW (Tilera), TRIPS, Monarch (Raytheon), and IRAM.

Processors optimized for small latency, such as CPUs, rather than for large throughput, such as GPUs.

Processors optimized for small latency, such as CPUs, rather than for large throughput, such as GPUs.

Does anyone know if this presentation (or the same topic in a different venue) was recorded and placed online? I’m really curious to hear the explanation that goes with these slides…

Does anyone know if this presentation (or the same topic in a different venue) was recorded and placed online? I’m really curious to hear the explanation that goes with these slides…

“This architecture seems more or less in-line with the Stream Processors that Bill Dally’s team have been designing in the past.”

Yes, it seems so, but the main difference is that it isn’t a strict stream model like Cell or Imagine, where the operands have to be explicitly loaded into local memory. Bill admitted himself that that model is too restrictive and will limit programming productivity.

The other big change I heard from Bill’s presentation is that in the future, there will only be 10s of threads executing in each SM instead of hundreds like now. That seems to be an attempt to improve power efficiency by reducing the very large register file we have now. But the consequence seems to be that the memory hierarchy now needs to be deeper (more levels) to compensate for the reduced parallelism. It could make performance prediction much less predictable. For example, now, if I have a kernel with a memory load phase, followed by a computation phase, as long as the computation phase takes longer than the load phase, it’s quite certain the overlap will be good and the ALUs will be kept busy. But with the new architecture, there seems to be a worst case, where there aren’t enough threads to keep the ALUs busy when a cache miss happens. An explicit DMA transfer approach would solve this problem, but then we are back at the problem of limited programming productivity.

“It suggests that different lanes would be able to execute different instructions…”
Yeah, that would seem pretty useful for databases & ray tracing, where pure SIMD/data parallelism doesn’t fit very well. I think the key is to exploit a limited form of task parallelism as described here by Andrew Glew.

One thing I discovered is NVIDIA Maxwell’s proposed performance/watt is similar to Imagine’s. According to the Storm 1 brochure,
it’s capable of 224 G int8 multiply-adds/s @11.2 W. Assuming a 64bit floating point multiply can be substituted with (52/8)^2 = 42.25 8bit multiply-adds (I assume adds are free), that’s 0.47 Gflops/watt double precision.

I believe all the Storm processors were built with 130nm technology. That’s 5 die shrinks away from Maxwell’s 22nm. So the power saving should be >= 32x better (since capacitance of each transistor reduced 1/32), assuming no threshold voltage scaling. So Imagine’s scaled efficiency would be ~15.1 Gflops/watt double precision - very close to NVIDIA’s chart.

It seems VLIW and out of order execution (on ARM) are making a comeback despite their bad reputation in the past. Did they find better ways to make them better?