NVIDIA’s Bill Dally gave a talk last week at SC10 about how to build supercomputers at the Exascale level… what would it take to build an ExoFlop system? That’s almost 2 orders of magnitude more powerful than the best supercomputers today. His talk gets down into very interesting details both at the system level and the low level GPU chip level. A lot of the facts are well known to everyone (power use dominates almost all design decisions) but there’s lots of interesting thinking about different ways to minimize it. For example memory caches can be used not with the goal of speeding anything up, but to provide a more local memory reads which uses 50pJ of energy as opposed to an offchip DRAM read using 5nJ, an order of magnitude more.
There’s also a rough GPU design for an approach how to get multiple order of magnitude speedup over today’s design… it’s not specifically about Kepler or Maxwell, but obviously there’s a lot of similarity to the way even Fermi works now. The theoretical GPU design has a goal of 128 SMs with 8SPs each (which is more like GT200’s similar 8SP/SM than GF100’s 32 or 48 SP/SM). What is interesting in the new (hypothetical!) design is that each SM has four DP FMA engines, not one like current GPUs. (This actually makes sense with the facts from earlier in the presentation showing FP compute is not the bottleneck, that uses a lot less power than data transfer does.) I’m not sure how that would be implemented… new PTX instructions which allow choosing all 16 (!) arguments for the FMAs (!), some super-clever 4-deep ILP, or the simplest but least versatile vector SIMD like approach (that AMD still uses in its GPUs now, and NVIDIA left behind back with G80)
The presentation slides are here… the second half is the most interesting part.