GTC news and keynote

Keynote just finished, lots of fun news. From memory:

Tegra 5 (Logan) will be full CUDA with a Kepler class GPU. 2014. Flops seemed huge compared to current SoCs.

After Maxwell is Volta with stacked 3D memory. bandwidths will be approximately 1TB/sec, with lower power and latency. Volta will have Denver ARM CPU (no details).

Nvidia Grid is sexxy for virtualizing workstations. There is a new GPU for the 16 GPU 4U boxes… Specs not mentioned but they use DUAL GPU cards. k10s or a new chip?

Maxwell has “virtual addressing.” That was not defined.

Thanks for the update! hope to be there on Wednesday.

In Mark Harris’s session later that day, it was explained further what “Unified Virtual Memory” is. I was confused when I saw UVM mentioned in the keynote, because my first though was: “Wait, we already have UVA. What’s this??”

The ultimate goal of UVM is basically to use page faults in the virtual memory systems to detect when a piece of memory is being accessed on the GPU and move the pages to the device, and then move it back to the CPU when the CPU accesses it. The vision is that things like “cudaMemcpy” should become optimizations rather than requirements for data movement between the CPU and GPU. The full implementation of UVM will require the hardware changes in Maxwell, but the plan is to release a “UVM-lite” in the future that works on Kepler.

Mark showed a very nice demo where he took a simple CUDA program and rewrote it using “UVM-lite.” This model requires you to use a special “managed” version of cudaMalloc that tells the CUDA runtime you would like to opt into the UVM system. The memory allocated by the managed version of cudaMalloc is then directly useable both on the host and device, so you no longer need to keep separate host and device pointers around. The copies to and from the device are handled automatically for these managed pointers. (Edit: Note that this is different than UVA, where the memory reads are issued over the PCI-Express bus, but the data is not copied to or from global memory on your behalf.)

Also, I think you might have garbled two different announcements. The Tegra processor after Logan will have Project Denver 64-bit ARM cores and a Maxwell GPU. Volta is a separate thing with the stacked DRAM, and I don’t think there was any mention of ARM with that.

Oh, and the other fun announcement was a new ARM+CUDA mini-ITX development board called Kayla. Kayla will come in two forms:

  • A Tegra 3 connected to an on-board, low power Kepler GPU. This GPU will support dynamic parallelism (cc 3.5?), so it has to be something not released yet.

  • A Tegra 3 connected to a PCI-Express slot that you can plug your own GPU into.

Both of these systems are intended to get developers up to speed with running CUDA on ARM systems in preparation for Logan. The demos running Ubuntu were quite impressive.

Indeed, I did garble “Maxwell and Denver” together. Maxwell is the 2014 new architecture of course. Denver is the integration of ARM as CPU host. “Parker” is a Tegra part in the 2015 timeframe which combines a Maxwell GPU and Denver ARM CPU. See:

In the ARM+CUDA session today (right after Mark Harris’s talk) Don Becker talked about Kayla quite a bit, and gave a quite convincing promise that CUDA ARM support will be fully first class with x86, including all NVidia libraries and support tools like debuggers/memcheck, etc. It’s obviously not fully there now (you even have to cross compile on an x86 host!), but still impressively functional (as the demos show).

His other side focus was getting into all the power savings and measurements needed to quantify it. ARM is going to reduce host wattage down to literally just a few watts. Obviously crucial for mobile battery usage, but ALSO for HPC power efficiency.

I’ll have to go find the slides for the ARM session when they are posted. I had to bolt to go check out Dr. Anderson’s talk. :)