(Apologies for this post veering into discussion of Cell, but I think it is relevant for understanding Larrabee.)
Last week I was in some classes on how to program the Cell, and I have to say that Larrabee looks at lot like the next generation of the Cell processor. Both have many common features:
- Many in-order processing cores with SIMD instructions operating on wide vector registers
- Generous ~256 kB local storage per core (different forms)
- Fast bidirectional ring bus connecting cores to each other, and to a DRAM memory controller
The enhancements of Larrabee make sense if you are willing to spend the extra transistors, and your initial target market is graphics rendering:
- Convert the local storage of Cell from software managed scratch-space to a real L2 cache, coherent across cores. Simplifies coding a bit.
- Make the SIMD operands wider, going from 4 single precision floats in Cell to 16.
- Drop the Cell concept of a supervisory general purpose core (the PPU) now that the rest of the data processing cores are general purpose enough anyway. Spend those transistors on more data processing cores.
As mentioned, it’s hard to compare to CUDA without knowing what Larrabee and GT300 will look like in 2010. Comparing today’s CUDA devices to Cell, I’ve come to appreciate that while the Cell is an impressive chip, it’s not superior to CUDA in a general sense, just different. Cell has a lot of flexibility to support different parallel workloads (like task pipelining), while CUDA does a few things really well:
-
Much higher device memory bandwidth. The GTX 285’s theoretical memory bandwidth of 159 GB/sec is mindblowing when you realize that the internal ring bus on the Cell has a bandwidth of ~200 GB/sec, and the off-chip memory bandwidth is 25.6 GB/sec. For very large streaming workloads, CUDA can’t be beat.
-
More floating point units. The GTX 285 can complete 240 MAD instructions as well as 240 MUL instructions per clock. The Cell’s 8 SPUs (assuming you aren’t trapped on a crippled PS3) can each complete 4 MADs per clock, for a total of only 32 MADs per clock. The Cell clock rate is double the GTX 285, so that’s effectively 64 Cell MADs per GTX clock, still much less than the GTX 285. Not to mention that it doesn’t look like the Cell has anything like the special function unit on the CUDA chips.
-
Price: Yeah, a PS3 is about the same cost as a GTX 285, but the Cell processor is quite handicapped by the environment. 1 SPU is disabled to improve chip yield and another SPU is taken over by a hypervisor to keep your from directly accessing some of the PS3 hardware. Only 256 MB of memory is available, and if you use the PS3 as an accelerator to a normal computer, your interconnect is a gigabit ethernet link. To do better than that, you have to buy a Cell blade or a Cell PCI-Express accelerator card, which start at $6k and go up from there. Even the extra fancy PowerXCell 8i variant of the Cell, which has much improved double precision performance, only slightly beats the double precision performance of the GTX 285 at a massive price premium. This all is still true even if you compare the “enterprisey” Telsa cards with Cell.
-
Simpler programming model. This is a matter of taste, but I think the SIMT programming model of CUDA is pretty ingenious and vastly simplifies the problems of dealing with vector hardware and hiding memory latency (once you stop trying to pretend CUDA is pthreads, of course). In Cell, there appears to be a lot of coding effort put into manually interleaving DMA requests to the off-chip device memory around computations to hide the memory transfers. Massively oversubscribing the compute units with zero-overhead threads seems to be a much more elegant solution for data-parallel problems, not to mention that you don’t have to fuss with vector registers and SIMD intrinsic functions.
To come back on topic to Larrabee, it sounds like Intel is poised to fix the first three of these problems to make a Cell-like architecture competitive with CUDA in data parallel tasks. Certainly Larrabee as described will have hundreds (maybe ~100?) of floating point units, and to compete in the GPU market, it will need to be sub-$500. As I said, the programming model is a matter of taste, and Larrabee seems to be aimed at continuing the Cell model. However, as that paper points out, it would be not hard to map a CUDA model to the Larrabee hardware.
IMHO, the biggest CUDA shortcoming compared to Cell/Larrabee (as long as NVIDIA can keep the memory bandwidth/FLOPS edge into the next generation) is the shared memory/caching situation. The relatively tiny amount of shared memory, and complete lack of caching for device memory (texture cache aside, which is very small and read-only) means CUDA programmers have to think really, really hard about memory access patterns. Cell has 256 kB of something very much like shared memory, whereas Larrabee devotes that space instead to a real L2 cache. I could see either of these things being helpful to allowing CUDA to work more easily with varied data structures besides flat arrays of float/float2/float4.
(Anyway, enough rambling… Unfortunately I happened to be thinking about this issue a lot this weekend while the Cell documentation was percolating into my brain. :) )