Larbee description

Some recent posts about Larbee -…rabee_paper.pdf

Well first they are now talking about only mid 2010 for a 32 cores machine. However I’d be happy
to hear your thoughts about Larbee vs. GPU+CUDA.
I know GPU is here and now, nVidia are probably going to release a new 300 series surely by mid next year and
continue to give us great performance in current and near future GPUs.
However I still hear here and there some people in the industry/company which are favor of Larbee or more “standard”
way (as opposed to GPU) like Larbee.
I think that the learning curve of Larbee/Ct/OpenCl will be hard for newbies anyhow (maybe even harder then learning

Any thoughts are more then welcomed :)


It’s larrabee I think. Also, even though there are “32 cores”, each can execute vector instructions which operate on 16 data elements. I’m going to avoid excessive speculation, but I personally prefer CUDA’s SPMD style versus the SIMD SSE confusion.

However, I think Larrabee’s use of cache versus explicitly shared memory could potentially make it much easier to achieve good performance in practice (even if the 16-operand SSE gets annoying). Of course, if nVidia could add more software to make achieving decent performance possible (and remove weird performance cases like shared bank conflicts, partition camping, which may be possible with hashing), CUDA could be as compelling.

I would imagine the 300 series is going to be awesome, and heading in the right direction (away from the CPU, towards massive parallelism). However, going from rumored specs alone, it isn’t improving the memory bandwidth to number of processors ratio.

As far as architecture, I think major steps will be with better synchronization and communcation. CAS is theoretically powerful (as in capable of doing a lot), but it isn’t fast; other things like atomicAdd are quite restricted. Having a multi-teraflop processor is exciting, but it’s frustrating when one can’t use them.

Problem with Larrabee is that most of information about it is Restricted NDA. From what I know about it it looks much better than current generation of GPU. But it will compete with GT300 which we know almost nothing about…

AFAIR Intel is not really concerned about GPGPU at the moment, they’re busy with helping game dev’s to port/adjust their games for Larrabee.

Yeah I think the best the HPC community can hope for in the short term from Intel with Larrabee is a functional OpenCL implementation. I don’t doubt that in time there will be an MKL release which is “optimized” for Larrabee, but right now everything is completely speculative and tied up in NDAs. The reality is that Intel are now about where NVIDIA were at the beginning of 2006. Irrespective of the relative merits of the hardware, it is going to be a while before Larrabee has anything like the HPC ecosystem that CUDA has. Gathering that sort of momentum won’t happen overnight.

That bring the question of what can nVidia do to compete with Intel?

I also think that Intel might be in mid 2010 in the same place nVidia was 2-3 years ago, however I guess they are big

enough to catch up fast.

Question is whether CUDA+GPU will be avail for the next 2-3 years from now and then be replaced by OpenCl+Larabee ???

where does this put us developers and companies?

those are just doubts not evil talk btw :) I’m greatly impressed by CUDA everytime i write a new kernel :) I truely

hope this is the future :)


(Apologies for this post veering into discussion of Cell, but I think it is relevant for understanding Larrabee.)

Last week I was in some classes on how to program the Cell, and I have to say that Larrabee looks at lot like the next generation of the Cell processor. Both have many common features:

  • Many in-order processing cores with SIMD instructions operating on wide vector registers
  • Generous ~256 kB local storage per core (different forms)
  • Fast bidirectional ring bus connecting cores to each other, and to a DRAM memory controller

The enhancements of Larrabee make sense if you are willing to spend the extra transistors, and your initial target market is graphics rendering:

  • Convert the local storage of Cell from software managed scratch-space to a real L2 cache, coherent across cores. Simplifies coding a bit.
  • Make the SIMD operands wider, going from 4 single precision floats in Cell to 16.
  • Drop the Cell concept of a supervisory general purpose core (the PPU) now that the rest of the data processing cores are general purpose enough anyway. Spend those transistors on more data processing cores.

As mentioned, it’s hard to compare to CUDA without knowing what Larrabee and GT300 will look like in 2010. Comparing today’s CUDA devices to Cell, I’ve come to appreciate that while the Cell is an impressive chip, it’s not superior to CUDA in a general sense, just different. Cell has a lot of flexibility to support different parallel workloads (like task pipelining), while CUDA does a few things really well:

  • Much higher device memory bandwidth. The GTX 285’s theoretical memory bandwidth of 159 GB/sec is mindblowing when you realize that the internal ring bus on the Cell has a bandwidth of ~200 GB/sec, and the off-chip memory bandwidth is 25.6 GB/sec. For very large streaming workloads, CUDA can’t be beat.

  • More floating point units. The GTX 285 can complete 240 MAD instructions as well as 240 MUL instructions per clock. The Cell’s 8 SPUs (assuming you aren’t trapped on a crippled PS3) can each complete 4 MADs per clock, for a total of only 32 MADs per clock. The Cell clock rate is double the GTX 285, so that’s effectively 64 Cell MADs per GTX clock, still much less than the GTX 285. Not to mention that it doesn’t look like the Cell has anything like the special function unit on the CUDA chips.

  • Price: Yeah, a PS3 is about the same cost as a GTX 285, but the Cell processor is quite handicapped by the environment. 1 SPU is disabled to improve chip yield and another SPU is taken over by a hypervisor to keep your from directly accessing some of the PS3 hardware. Only 256 MB of memory is available, and if you use the PS3 as an accelerator to a normal computer, your interconnect is a gigabit ethernet link. To do better than that, you have to buy a Cell blade or a Cell PCI-Express accelerator card, which start at $6k and go up from there. Even the extra fancy PowerXCell 8i variant of the Cell, which has much improved double precision performance, only slightly beats the double precision performance of the GTX 285 at a massive price premium. This all is still true even if you compare the “enterprisey” Telsa cards with Cell.

  • Simpler programming model. This is a matter of taste, but I think the SIMT programming model of CUDA is pretty ingenious and vastly simplifies the problems of dealing with vector hardware and hiding memory latency (once you stop trying to pretend CUDA is pthreads, of course). In Cell, there appears to be a lot of coding effort put into manually interleaving DMA requests to the off-chip device memory around computations to hide the memory transfers. Massively oversubscribing the compute units with zero-overhead threads seems to be a much more elegant solution for data-parallel problems, not to mention that you don’t have to fuss with vector registers and SIMD intrinsic functions.

To come back on topic to Larrabee, it sounds like Intel is poised to fix the first three of these problems to make a Cell-like architecture competitive with CUDA in data parallel tasks. Certainly Larrabee as described will have hundreds (maybe ~100?) of floating point units, and to compete in the GPU market, it will need to be sub-$500. As I said, the programming model is a matter of taste, and Larrabee seems to be aimed at continuing the Cell model. However, as that paper points out, it would be not hard to map a CUDA model to the Larrabee hardware.

IMHO, the biggest CUDA shortcoming compared to Cell/Larrabee (as long as NVIDIA can keep the memory bandwidth/FLOPS edge into the next generation) is the shared memory/caching situation. The relatively tiny amount of shared memory, and complete lack of caching for device memory (texture cache aside, which is very small and read-only) means CUDA programmers have to think really, really hard about memory access patterns. Cell has 256 kB of something very much like shared memory, whereas Larrabee devotes that space instead to a real L2 cache. I could see either of these things being helpful to allowing CUDA to work more easily with varied data structures besides flat arrays of float/float2/float4.

(Anyway, enough rambling… Unfortunately I happened to be thinking about this issue a lot this weekend while the Cell documentation was percolating into my brain. :) )

Wow, interesting analysis and details (I am surprised by the low ring bandwidth on Cell).

I can absolutely agree with you that the CUDA shared memory model is indeed a limitation in many applications. Even the simplest data compression uses local references to 32K of recent data, and that won’t fit into CUDA shared memory, making such algorithms quite painful. But also that shared memory is explicit, and the compiler and hardware don’t use the CPU/Larrabee method of degrading registers to (fast) local L1 memory and back (and from there to L2 or main memory.) There is no “register count” for a CPU or Larrabee app, nor is there an explicit maximum memory use… it just degrades pretty transparently as your scope expands past the cache hierarchy levels.

But of course the CUDA memory design is a tradeoff. It’s likely much much much simpler on the hardware end, and those saved transistors can go towards even more processing cores. Offer me a GPU similar to a GTX285 but with L1 cache like behavior (and no register limits) and I’ll do a happy dance. But if that’s at the expense of having only 15 SMs instead of 30, then I have a dilemma indeed! We have no idea what NV’s exact tradeoffs were in the G200 design, but the hardware and software guys there have proved they’re awfully smart so I can’t question their choice of memory system tradeoffs.

You’re completely exactly right that CUDA programmers think really really hard about memory access patterns, and that’s a reaction to the CUDA memory design itself. But in some ways that’s just part of the challenge, too, and I find it quite fun to redesign algorithms to fit the CUDA model… it’s sort of like old-school wild days when nobody knew the “right way” to do things so you just have to experiment and think a lot. Even the simple tasks of sorting or FFTs must be completely reconsidered to get best performance, and when successful, NV GPUs just walk all over the CPUs.

With regards to Larrabee, even with Intel’s paper, there’s lots of speculation. The one fact that bugs me, endlessly, is how they’ll deal with latency hiding. They have 4 active threads per core, so when one thread hits a memory stall, the other 3 can run. (This is basically 4-way hyperthreading… P4 and i7 CPUs have 2-way hyperthreading). But only 4 threads? What if they all stall? Then you wait. The equivalent to Larrabee’s 4 threads in CUDA is warps. But an SM can have 32 warps active… giving you such versatile latency hiding that it’s rarely an issue in CUDA. Now Larrabee’s caches will significantly reduce the need for high latency device memory access, but that 4:32 ratio is still pretty dramatic and I bet a lot of applications are going to have issues with that in Larrabee.


Thanks a lot for the information. One thing I dont understand (BTW my hardware skills are not that great to say the least :) ),

how fainful, for nVidia, is it to add more registers/increase the shared memory size?

Is it a matter of money? hardware limitations? power/physics issues? what??

What if nVidia built another line of products. like the C1060 has 4GB ram versus 1GB for the GTX line, what if there was another line

with much higher register count? with much higher shared memory? bigger constant memory (constant memory is something btw, I find very un-usefull, its just

too restricted and too small for most of the constant data I have).

What would such thing mean? additional XXX$? I’d take it with two hands… I think :)



It mostly comes down to die size. Successive generations of NVIDIA GPUs have been hovering around the upper limit of what could be considered the largest economically feasible die size and power/thermal envelope, and I doubt that will change. Today, the biggest NVIDIA GPU dies (and the GTX200 is enormous by modern standards) are being fabbed using TSMCs 55nm process. At roughly 500 square millimetres maximum die area, that really defines the total transistor count. There is no free lunch. If you want more transistors dedicated to stream processor level resources or cache, then you will have to live with fewer stream processors or some other architectural compromise.

AMD have just brought their first GPU fabbed with TSMCs new 40nm rule to market, and I expect NVIDIA won’t be too far behind, although whether we see die shrunk compute 1.2 capable designs on 40nm first and the next “big thing” later is anyones guess. In either case, the next “big” GPU will probably be about the same die size and have considerably more transistors than the GTX200 does. What their architects choose to do with those additional transistors is pure speculation at this point.

In one respect, Intel and IBM have something of an advantage of NVIDIA, in that NVIDIA are fabless and totally at the whim of their fab partner(s) to deliver process and yield improvements which let them up the transistor count of their designs. On the other hand, being fabless releases a hell of a lot of capital that would otherwise be tied up in fabs and process R&D.

As avidday says, die space is pretty much it. The G200 die is already huge, 576mm^2, while G92 was 230mm^2 and an Intel Core2 Duo is 143 mm^2. And believe me, they used every last millimetre :)

Making huge dies is hard and expensive (less chips from a silicon wafer), there’s little room to make bigger chips. Here’s where the “die real estate” term comes in - you have a fixed amount of space available and you need to put processing units, caches, memory controllers etc. there. So there are trade-offs - do we spend more space on cores or on cache? Here’s a GT200 die - the pink “Texture” boxes are, I believe, L2 texture cache - notice how huge they are. NVidia already spends considerable space on cache.

Problem with cache (and our shared memory, generally any super fast kind of memory) is that it takes up a lot of space. Caches are built differently than RAM and are not as easily miniaturized. Consequently, it’s much easier to add RAM to a given piece of hardware, especially since it resides off-chip. Cache must reside on chip or else it would be as limited as RAM in terms of latency.

Now as for Larrabee, I’m a bit sceptical about using x86. It’ll allow current software to run, sure, but is it the most effective ISA for doing graphics?

The point about die area is good: The Cell is a much smaller chip than any GPU. Even with the 90 nm process, the chip area was 235 mm^2, and now with the 65 nm process, it is more like 120 mm^2. In terms of FLOPS per square millimeter, GT200 and Cell are very similar.

This also brings up another issue. I’m excited about Larrabee because by entering the GPU market, Intel is getting on a fast train. PC graphics is a very large and competitive market, which keeps prices low and updates fast. Cell lives in completely different markets. Game consoles, by design, only update every 5 years, and the “accelerator board” market has a handful of devices all priced in the $5k and up range. So despite my interest in Cell, this makes it a non-starter for “small scale supercomputing” of the sort I usually do. Larrabee however is going to need to be fast and cheap to compete with NVIDIA and ATI, so I look forward to being able to afford it. :)

As for the cache question on CUDA, I definitely agree that 256 kB is probably not required given the tradeoff in chip area. But 32 or 64 kB would be useful for some block-based workloads (like compression, as was mentioned). Or perhaps it is better for Larrabee and CUDA to take different tradeoffs, so we can pick whichever card is best for our application. For a few hundred dollars each, I don’t need either device to be the be-all-end-all of parallel computing.

(Re: the question about the Cell interconnect bus. Although it sounds low, I’m told it’s high enough compared to the bandwidth of each SPU that in practice it will never be a bottleneck.)

So how is nVidia going to compete with Intel? What does it mean for CUDA in 2-3 years (presumably maybe Intel will catch up on nVidia by then??)

Will we all move to OpenCL and run on Larbee? beside the fact that CUDA is here now and Larbee is not, its a bit troubling thought.

I wonder what’s the official nVidia position on this issue…

Indeed a very fast train… thats the way most of us like it I guess ;)


One thing that just occurred to me about Larrabee.

I think the potential ace that Intel have up their sleeve with Larrabee is the possibilty to have the “GPU” hanging directly off the Quickpath interconnect in a NUMA style arrangement, which jettisons all PCI-e bottlenecks and potentially allows much greater coherency between cpu cores and gpu cores. AMD and partners have shown the enormous potential of Hyper Transport when things like shared memory interconnects and infiniband adaptors can sit coherently with the CPU in HPC applications. I don’t see any reason why Intel couldn’t potentially do the same with Larrabee.

Intel have a history of rather jealously guarding their cpu interconnects and providing rather strict interpretations of what third party licensees can and cannot do. This might well be another battle ground with certain third party rivals in the hotly contested chipset and gpu segments…

Oh yeah, this would be great (well, not so much for NVIDIA)! I can only imagine how awesome zero copy would be if you could plug your GTX 285 directly into the QPI/HT link.

I can guarantee that you’ll never find out the answer to that until shortly before new cards hit the stores. :)

I’m not sure what you’re asking–if you’re worried that we’re going to kill CUDA because OpenCL exists and offers some similar functionality, we’re not. There are a lot of things we can do with CUDA to make it better (for end users, application developers, system administrators–everyone, really), so we’re going to continue to improve it. OpenCL is one particular vision of how massively parallel computation should be done, but there is plenty of room for other ideas.

I think he’s asking what NVIDIA is planning to do in future chips to compete with Larrabee, which is why I don’t think he’s getting an answer. :)

If you’re in a position where you can easily generate 32 warps per SM, you’re very lucky. For me, latency is often a problem because I just can’t create enough threads. Even 10 warps per SM requires about 9600 threads on a GTX280. A lot of my algorithms just don’t have that much parallelism.

So an architecture like Larrabee sounds very appealing: 32 cores * 4 threads * 16 way SIMD gives only 2048 threads required to completely saturate the compute units. That assumes, of course, that they’ve built their memory architecture in a way that four threads really is enough to hide the latency. As you point out, that’s a big if. But it pretty clearly indicates that they’re not expecting programmers to generate the massive number of threads needed to hide latency with CUDA.


Thats exactly right :) nVidia has allready said in the past that CUDA will be supported in the next years.

GPU Hardware future however is far more interesting and vague :)

My company has some difficulties deciding on which way to go, when we dont know what we can expect in the next 2-3 years.

We already spent ~1 year in CUDA+GPUs, it wouldnt be productive if next year, for example, Larbee and nVidia will show the same performance.

Well, for me as individual, this past year, has been amazing, playing with CUDA and GPU and seeing the results - and the future !! :)

However I understand the management fears/doubts.

I think nVidia should somehow help those management guys make the right decision (choose GPUs ;) )

My 1 cent…


You speak manly about the size of the shared memory, but the size of the register is very small too!
Tell me if I am right, but if you want to work with 200 double for example, you can just have 40 threads (40*200=8000) active
by multiprocessor.
Secondly it is not possible to define arrays working on the register, which is very very unconfortable…
Array are in share memory but if you have a lot of acces for all the threads you will have conflicts, so even if
the size were bigger it will not be necessary a solution.

I hope that the new generations will give solution to that. It does not appear so difficult, is it ?