I’ve just found some time to read through the Siggraph paper on Larabee. While definitely an interesting concept, I’m having a hard time believing it will actually work great as a GPU. And the performance benchmarks they gave aren’t very optimistic either.
Making F.E.A.R (a video game) run at a predicted* 60 fps requires 25 “Larabee units”. A Larabee unit is equivalent to a single core of a Larabee running at 1GHz - that’s a performance unit they use with the assumption everything will scale up linearly with hertz and core count. That evaluates to 8 cores @ 3GHz. So, running a 2005 game at playable framerates requires the basic Larabee. A Geforce 9800 GX2 reaches twice this speed. A lowly 9800 GTX (a rebranded 8800 GTS) outperforms this basic Larabee.
They will go up to 32 cores as we’re told. Assuming performance will scale linearly, quadrupling framerates, we’ll get 240 fps from the best Larabee. That’s exactly how much a GTX 295 gets you today.
(all benchmarks for 1600x1200, 4x FSAA)
I predict discrete Larabee GPUs (PCI-e mounted) will be a failure. Perhaps there’s some market in replacing today’s crappy IGPs with Larabees that double for CPUs if they turn out to be as power efficient as we’re told. This makes me worry for its future - Intel went with either getting deep into high performance GPU market or loosing the game and suffering Cell’s fate.
As for programming them for HPC, it seems they will be quite similar to Cell in this regard. Even with the coherent cache, one of the extensions Intel introduces to x86 are intrinsics for manual cache managment. Whether this is something the programmer will have to deal with to get good performance (as he had with Cell), I don’t know. The C++ compiler is said to be all nice and friendly, with OpenMP and P-threads and auto-vectorization. Worth noting is that Cell also had such a compiler (CellSS), where parallelism was done through short pragmas and memory management was implicit. However, this didn’t produce fast code. And memory coherency with multi-access is basically solved through tons of locks, which means truly random access might not be so pretty after all. Hopefully the performance hits will be less severe than CUDA’s uncoalesced accesses.
A nice thing that makes Larabee work different than a Cell (or a GPU) is that there are separate, possibly concurrent pipelines for “normal” x86 ops (branching, pointer chasing, scalar math etc.) and vector ops. This, according to Intel, makes dynamic and irregular structures much friendlier than it was for a under-the-hood-SIMD architectures. On the other hand, since all the computational power comes from 16-word wide vectors and in-order execution, I estimate pointer chasing will still be something to think through twice before implementing.
- predicted - in this benchmark they actually tested a single core and assumed performance will scale linearly.