It’s a custom designed ARM core (i.e., uses ARM instruction set, but new implementation) aimed at desktop, server and supercomputer applications. You can imagine that the ARM core(s) on the die play the role of the PPE in the Cell architecture, and GPU multiprocessors play the role of the SPEs.
Warm up your ARM compilers: someone’s going to build a CUDA-cluster in the next 3 years with no x86 inside.
I would view this more like “NVIDIA jumps on the Fusion bandwagon, no x86 license in sight”. Of course there is no particular reason why an ARM core needs to be less powerful than a low-mid range x86 core, and the resulting SOC should be useful in a wide range of applications. A Windows ARM port means that the Windows NT code base has now been ported to every major modern architecture of the last couple of decades (at least ia32, x86_64, Alpha, PowerPC, Itanium, MIPS that I can remember). Few survived for long, so let’s see whether the ARM port struggles or not. My guess is that it will struggle, for the same reason all the other non-x86 Windows versions did - software incompatibility.
Yeah, I’m very curious to see what’s possible once you remove the latency and bandwidth restrictions of the PCIe bus.
If the ARM supervisory cores are wired into the CUDA L2 cache and GPU memory bus, it would lower the practical threshold for sending a piece of a calculation to the CUDA cores quite a bit. Host pointers and device pointers become the same thing, so there is no cudaMemcpy() required. I also imagine that with the CPU and GPU that close, you could implement some special handshaking allowing for very low latency CUDA launches. Depending on how the L2 caching is handled, recently written data would naturally be handed from CPU to GPU and back without having to hit main memory at all.
At that point, GPU utilization of an individual kernel doesn’t really matter; pretty much any data-parallel section in the code would be handled more efficiently by the CUDA cores. You can then achieve full utilization by firing off a lot of small kernel launches (even 1 or 2 blocks) in your host code as early as possible and then blocking on their completion just before you need to use the results. Basically, the same as CUDA streams now, but a lot more of them. Hopefully someone is thinking of extensions to CUDA C to make this style of programming easier to read.
I know that people in the gaming industry are also extremely happy about this because they expect to finally have low latency communication between the CPU and GPU. Feasibly having great physics interacting directly with the plater etc,… This is important for us working in the HPC segment, we are riding piggy back on the gamers which are the main market drivers!
You are right that it should now become a good option to hand over even the small data parallel jobs to the GPU. I think that the CUDA C framework need not undergo huge changes on the surface to suffice but i guess the middleware APIs will need a workover to be adequate.
When can we expect the first such CPU+GPU, already in Kepler? Maxwell ? :D
It’ll be fantastic to have a shared memory for the CPU and GPU, and all the other benefits mentioned above, but since the space allocated for the CPU is much smaller, I’m wondering how powerful the GPU section of such a combined chip would be. Any thoughts on that?
This is one benefit to going with ARM instead of x86. ARM cores tend to be pretty small in area (since more area means more power usage) compared to x86. A little Google-research (which is arguably going to be low accuracy) suggests that the ARM Cortex A9 in the Tegra 2 is way smaller than a CUDA multiprocessor. Assuming a beefed-up ARM with a decent local cache, I would hope you could trade a CUDA multiprocessor for 2 ARM cores on the die. That’s not so bad.
One another possibility would be binary compatibility between ARM cores and CUDA cores – That would make CUDA kernel launches go away. Therez no need for separate CUDA binary hidden somewhere in the EXE.
The EXE itself would be made up of serial and parallel sections. Serial goes to ARM and Parallel goes to CUDA cores automatically.
That would be cool!
It would be nice but it would also be hard to make happen for licensing reasons. I think ARM gets a cut of the profits for each core (NVIDIA, Qualcomm, TI, etc) that implements their ISA, giving them more cores gives them more money. With a fancy compiler you could do something similar by justing compiling everything into some IR (PTX) and then jitting for ARM or the GPU cores on demand.
Here’s an interesting question. Given a compiler that could target both, which would be faster/more efficient, a CUDA multiprocessor or an ARM core? How big would the difference be? Looking ahead 5-10 years, what will happen to this difference?
I really like the idea of adding an ARM core on-die in the next few years because it cuts out a large amount of unnecessary components from the system, namely the ‘discrete’ CPU, an additional memory hierarchy, and the board to board interconnect. Sitting on-die, all the ARM core needs to do is run the OS, system software, and sequential portions of applications.
I think that the next big challenges in GPU computing involve moving all three of these components onto the GPU cores and removing the CPU cores completely. How much of operating systems and systems software can be parallelized efficiently and how many sequential applications can be re-formulated using more parallel algorithms are still very much open questions. There are big performance and publicity gains to be had for whoever is able to successfully do these first.
The JIT idea looks cool. May be the IR could be a modified form of a PTX predicated with “Single thread execution - STX”.
So the IR could be [STX + PTX].
The ARM license thing is new to me. But then, I am not a license guru…Thanks for bringing up the point.
If GPUs can share the address space of an application, Can GPUs still retain their super-high memory bandwidth (like 140GB per second)?
Till day, I have no idea why the host PC’s Memory bandwidth is 10x slower than GPU (May be, because of Caches OR Because of slow peripherals?)
It’s a combination of a few factors. Mainly cost (CPU memory is cheaper), memory channels (6 on fermi, 3 on sandybridge), and that GDDR is optimized for bandwidth whereas DDR is optimized for latency (GDDR5 always transfers 32 bytes per transaction whereas DDR3 transfers at most 8 bytes per transaction, so at the same clock rate, GDDR5 devotes more pin bandwidth to data rather than commands).
Thanks! Probably thats one reason why the system bus is clocked less too… (because the memory itself can give data only slowly). But what prevents a CPU vendor from embracing GDDR5? Just the cost?
Thinking on those lines, I stumbled on an interesting thought. May be, the GPU can still retain it’s piece of global(video) memory - which however can be mapped directly into the process’ address space.
Application can define data like:
This data can be accessed by both CPU and GPU code seamlessly. CPU and GPU can even synchronize with this memory by providing “memory barriers” that will flush the CPU cache to GPU’s memory.
“cudaMalloc” can still be used to allocate dynamic GPU memory just like malloc(). But the good point is that it can be indirected by both CPU and GPU code.
cudaMemcpy will be redundant. “memcpy” would do. But cudaMemcpy may still be needed to take care of copying
non-cached GPU memory locations.
(GPU memory locations may not support streaming load operation of ARM - for example)
GPU can still retain cudaArrays, texture filtering and so on.
Current ARM cores are low-power, embedded designs.
Project Denver aims at the high-performance CPU market (from Bill Dally’s blog post: “from personal computers to servers and supercomputers”).
From my understanding, this means going against Intel Sandy Bridge and AMD Bulldozer. Or at least provide something more powerful than Atom or Bobcat, which are both already much bigger than Cortex A9.
For such big, high-performance out-of-order CPU, the overhead of x86 instruction decoding (on power and area) becomes swamped in the noise. Whether it is x86 or ARM should not make much difference at this point.
I think there are two types of products possible by “Denver”:
1.) A CPU with integrated graphics-core as direct competition for AMD Fusion/Intel Sandy Bridge. (basically a big Tegra2)
2.) A very flexible GPU used as classic graphics/compute accelerator board which is still controlled by a (x86?) host processor. But it can efficently process mixed serial+parallel workloads in contrast to current GPUs.
Option 2 is the more evolutionary step, easily fitting into the traditional PC (gaming/compute) market.
For Option1, i think the lack of x86 compatibility is a problem, at least initially.
Who is porting old games to ARM architecture, and how many people would buy a computer with high-performance graphics which can only run a small fraction of the available games?
It is a great architecture for HPC, of course, but i doubt that market is big enough. May be it is used in a console like Cell in PS3?