Here’s some notes (with my interpretation, since details were a little thin) from the GTC keynote live stream today:
The Telsa rumors seem to be correct. The first release of Kepler for the Tesla product line (the Tesla K10) looks basically like an enterprise-y GTX 690. The only detail given was that the single precision is 3x faster than Fermi, which sounds about right for two GK104 chips.
The next Kepler chip will be released in the Tesla K20 and have good (though not quantified) double precision performance and two interesting new features: Hyper-Q and dynamic parallelism. Edit: The Tesla K20 will be out in Q4 2012.
Hyper-Q: The GPU supports multiple “work queues”, which appears to be what is required to support multiple concurrent CUDA contexts on one device. Kepler (not GK104, I assume) will support 32 concurrent work queues, which I assume will translate to 32 concurrent processes. I think this will finally put the watchdog to bed. :) (Edit: This might not be the right interpretation. See my post below.)
“Dynamic Parallelism”: This is their term for the ability for kernels to launch other kernels. I will be curious to see how this is exposed to the software developer, but this is definitely a frequently requested feature.
There were some other interesting, but less CUDA related things:
Full GPU virtualization to allow virtual machines to share a single GPU.
Hardware support for streaming the GPU framebuffer to a remote device.
I missed the first 30 minutes, so those of you who were there should chime in with other details I missed.
Edit: Of course the real question is whether the GPU in the K20 will show up in a GeForce. No information was given on this (obviously), but I suspect the time frame for the K20 means we aren’t going to see any GeForce improvements until close to the end of the year. Horde your GTX 580s! :)
If you mean the new GPU in the future Tesla K20, then I don’t think it will be compute capability 3.0.
Hint: If you have a registered developer account (on the old or new site), you should log in and check out the CUDA 5.0 toolkit that has been posted. There is some new information in the CUDA Programming Guide about this future architecture.
There was a talk at 4pm about “Cuda 5 and beyond” where the new features Kepler GK104 and GK110 are proposing were presented. Unfortunately, I missed the 10 first minutes of it (how con one be so stupid as to miss something like this?). Nonetheless, what I saw was trilling:
Dynamic programming: that’s the big thing! You’ll be able to launch kernels from kernel, and have them to behave the way you would expect them to. For mesh refinement, that’s a killer feature!
GPUDirect to the next level: it becomes what you always expected it to be (and what you might have thought it was already), ie. a proper P2P GPU memory transfer through RDMA, even when playing with clusters.
Hyper-Q: well maybe it was during the time I missed… But that’s just trilling as well, especially for moderately parallel algorithms, where a sharing the GPU between many processes is possible. I have many of those sort of workload ready to exploit this feature.
As much as I haven’t been trilled by the gaming and video rendering part of the keynotes, as much I feels the HPC part and the CUDA roadmap exiting.
I can’t wait to see the sessions on “inside Kepler” and the “new features in the CUDA programming model”.
Reading through the Kepler Tuning Guide in the CUDA 5.0 documentation, I think I might have misinterpreted the presentation describing Hyper-Q. It sounds like what Hyper-Q fixes is a more subtle problem with multiple CUDA streams in a single process blocking each other in Fermi. The fundamental problem with Fermi (apparently) is that the driver has to serialize work from many queues into a single hardware queue on the device. This limits the power of multiple streams because you are locked into your queue ordering too soon, leading to suboptimal utilization for various combinations of queued work. Hyper-Q exposes multiple hardware queues to the driver, so that CUDA streams in software can map to hardware work queues and defer the scheduling decision as late as possible.
However, no where in this documentation does it say anything about multiple processes using the same GPU at the same time. That might still be possible, but it isn’t being advertised.
In thinking about applications for dynamic parallelism, I’m wondering how efficient the launch mechanism is. I could imagine a tree-traversal method where you launch a single block kernel from the CPU, and that block in turn launches several more single block kernels, and so on recursively.
To expand on Tim’s answer slightly, there is a technology called Proxy in CUDA 5. As I understand it, it is designed for use in MPI programs and it creates one CUDA context for all MPI ranks that share the same GPU.
Sarah Tariq briefly talked about this in her talk “S0351 - Strong Scaling for Molecular Dynamics Applications” (for those of you that want to pull up the video when its posted). She presented some benchmarks of how using Proxy improved NAMD performance significantly.
Is K20’s DP performance expected to be 1/3 of its SP performance given the fact that DP cores count is a third of SP cores?I think a single GPU K20 should perform similarly to K10 in the SP department. Then with one third SP performance, you have 1.5TFLOPS DP performance