Larbee description

Yes you’re correct :) any enhancements to the hardware limitations (local mem, shared mem, register, constant… ) is more then welcomed :)

I’m not sure I understand. Do you want arrays of registers? as far as I know (i played with it a bit as well) you

can define:

__global__ void MyKernel( ... )

{

  float fRegArr[ 10 ];

  ...

}

That would translate into a local array of registers per thread.

eyal

Right now there’s no real alternative to CUDA.

ATI Stream is not really ready for production development and Larrabee is planned for 2010 Q1.

There’s a good chance that Larrabee will support OpenCL, so if you’re not interested in CUDA-specific features like zero-copy memory access (and if you’re interested in those you’ve to stay with CUDA anyway) you’ll probably be able to switch from CUDA to OpenCL at that time without much pain.

Only if every index you’re accessing is known at compile time. If it’s not then array will be allocated in local memory.

you can put array in constant memory, and it is very fast , but when I define an array in a device function,

I see a total decrease of the performance.

I see that even with small arrays: so there is something special with array you think in the register.

In the forum, one says that “big” arrays (20 floats…!!!) will be put in local memory, one other says

that the register can not manage arrays. I think the register can not manage arrays.

See this discussion: http://forums.nvidia.com/index.php?showtop…ray+in+register

If someone is totally clear about that, I would be happy of a confirmation, or if I could learn that arrays can manage register.

Thanks.

I am positive about this too.

Registers are not indexable by variables on G80-GT200 architectures.

Indexable registers is a hardware engineer nightmare. Especially in the SIMT model, where every thread in a warp can potentially request a different address. Dependency tracking gets, hum, intractable…

I don’t know of any modern CPU that supports it.

Some GPUs do, for what I believe are legacy reasons. AMD does, but their initial implementation had bugs (R6xx_R7xx_3D.pdf, section 6.1.4), as an evidence on how difficult it is to do it right.

With regard to the increasingly cuda-specific comments, it’s good to point out that people have suggestions for CUDA because it’s actually out (c.f. Larrabee). The array access thing could change. Sylvain: out of curiosity, what do you mean by “dependency tracking”?

I meant how the instruction scheduler detects read-after-write, write-after-read and write-after-write dependencies between instructions.

Consider for example:

MOV R1, g[R0]   // load from global memory

ADD R3, R2, R0

Here the ADD is not dependent with the MOV and can be issued while the MOV is still pending.

If you have instead:

MOV R1, g[R0]

ADD R3, R2, R[A0]  // indirect register read

Here the hardware cannot assert that A0 != 1, so the instructions may or may not be dependent.

Of course you can always conservatively assume a dependency exists, and rely on other threads to hide the MOV latency, but you will loose potential performance.

It also breaks most hardware optimizations used in superscalar processors such as bypass networks (see AMD’s bug), register renaming…

A more general solution would be to turn the shared memory into a read-write cache, and map both local and shared memory to this cache.

It need not be coherent with caches of other multiprocessors, saving a lot of headaches.

This can benefit other applications including graphics, and it does not require changing the compiler toolchain (especially PTX).

I’ve just found some time to read through the Siggraph paper on Larabee. While definitely an interesting concept, I’m having a hard time believing it will actually work great as a GPU. And the performance benchmarks they gave aren’t very optimistic either.

Making F.E.A.R (a video game) run at a predicted* 60 fps requires 25 “Larabee units”. A Larabee unit is equivalent to a single core of a Larabee running at 1GHz - that’s a performance unit they use with the assumption everything will scale up linearly with hertz and core count. That evaluates to 8 cores @ 3GHz. So, running a 2005 game at playable framerates requires the basic Larabee. A Geforce 9800 GX2 reaches twice this speed. A lowly 9800 GTX (a rebranded 8800 GTS) outperforms this basic Larabee.

They will go up to 32 cores as we’re told. Assuming performance will scale linearly, quadrupling framerates, we’ll get 240 fps from the best Larabee. That’s exactly how much a GTX 295 gets you today.

(all benchmarks for 1600x1200, 4x FSAA)

I predict discrete Larabee GPUs (PCI-e mounted) will be a failure. Perhaps there’s some market in replacing today’s crappy IGPs with Larabees that double for CPUs if they turn out to be as power efficient as we’re told. This makes me worry for its future - Intel went with either getting deep into high performance GPU market or loosing the game and suffering Cell’s fate.

As for programming them for HPC, it seems they will be quite similar to Cell in this regard. Even with the coherent cache, one of the extensions Intel introduces to x86 are intrinsics for manual cache managment. Whether this is something the programmer will have to deal with to get good performance (as he had with Cell), I don’t know. The C++ compiler is said to be all nice and friendly, with OpenMP and P-threads and auto-vectorization. Worth noting is that Cell also had such a compiler (CellSS), where parallelism was done through short pragmas and memory management was implicit. However, this didn’t produce fast code. And memory coherency with multi-access is basically solved through tons of locks, which means truly random access might not be so pretty after all. Hopefully the performance hits will be less severe than CUDA’s uncoalesced accesses.

A nice thing that makes Larabee work different than a Cell (or a GPU) is that there are separate, possibly concurrent pipelines for “normal” x86 ops (branching, pointer chasing, scalar math etc.) and vector ops. This, according to Intel, makes dynamic and irregular structures much friendlier than it was for a under-the-hood-SIMD architectures. On the other hand, since all the computational power comes from 16-word wide vectors and in-order execution, I estimate pointer chasing will still be something to think through twice before implementing.

  • predicted - in this benchmark they actually tested a single core and assumed performance will scale linearly.

Hi,

Thanks for the detailed answer :). If Larabee will get to a GTX295, why is it bad? I guess that unless nVidia’s 300 line

will be much faster then Larabee it will be roughly the same speed then, no? or you mean that, currently, we can put

3-4 such GTX 295 in a machine, so basically it will be x4 faster then Larabee??

BTW- regarding Intel’s compiler, as far as I remember it will be based on a new langauge, Ct (as opposed to CUDA which

is C) and the syntax/code/ease of use is horrible. Most users will probably find it much much harder to get use to

then CUDA or even OpenCl

eyal

Well, Intel’s being very FUD-like with its hints and partial reveals and marketing. It really is hard to tell whether they’ve pulled off something amazing, a joke, or something inbetween until it’s actually in people’s hands. As a guess, I think it will be an “OK” graphics card but too expensive compared to an NV/ATI card (but intel will find a game or two to “prove” otherwise by a large margin.) For GPGPU, Larrabee will likely be subpar to the next generation G300 and ATI boards in many ways, though L’s unique architecture will make it completely crazy good in some other subfields. It will be all over the map, likely, depending on each specific app’s memory use and particular behavior. L’s ring bus cache will be the biggest differentiator… it does things which are very, very, awkward on an NV card… but conversely that cache is a huge waste and even bottleneck for apps that don’t need it.

As for L’s most popular API, I get the feeling that most developers will want to use OpenCL for it and avoid the raw LRBni Larrabee intrinsics. That’s just a guess on my end, based on the instinctual wincing when new redundant APIs are announced. OpenCL has its own big problems of being a lowest-common-denominator, but that’s also given developers a nice safe baseline and it will be awkward for them to crawl out of that to gain any extra power. Look at SSE on the CPU. After 10+ years, it’s STILL not something you start using initially, you’ll stick with plain FPU code and only start working with SSE when you really need to. LRB will have the same problem, why port from OpenCL to L’s intrinsics when it works “good enough” in OpenCL?

What I mean is if Larrabee can barely compete with today’s GPUs, I don’t see how it will manage to be competitive by the time it’s released in 2010 (hopefully). By then, NV and ATI will have put out a new generation of cards. And knowing we’re comparing the strongest and thus the most expensive L here, I don’t expect it will be a great winner when it comes to performance/buck. If the price of i7 is any indication of how much might Ls cost at release, Intel will have to pull of some really fancy marketing tricks to stay afloat. Strongest i7 costs about twice as much as a GTX 295 today, let’s hope the strongest Larrabee won’t cost twice as much as the then available G300s (or whatever line of GPUs NV and ATI will ship).

This is getting kind of tangential, but since we mentioned the question of GPUs connected more directly to CPUs:

Anandtech has a preview up of Lynnfield, the stripped-down Core i7 processor appearing in the “near” future. To make it cheaper, they drop the third DDR3 channel, and replace the fast QPI link with a slower link that isn’t suitable for graphics. However, to work around that, they put the PCI-Express 2.0 x16 link directly on the CPU. I’m very curious to see how this will affect host-to-device/device-to-host latency in CUDA. (I’m thinking by analogy to the latency benefits AMD saw when they first moved their memory controllers onto the CPU.) This could be very interesting in combination with zero-copy.

http://www.anandtech.com/cpuchipsets/showd…?i=3570&p=2

Anyway, this is just the first step in bringing the GPU closer to the CPU. Hopefully we’ll see more of this in the coming years…

What Big_Mac says seems quite reasonable. I’ve done some estimations on Larrabee performance (assuming 16 Larrabee cores @ 2 GHz) for tasks in my domain (crypto & password recovery) and it is only slightly faster than GTX 295. But that’s for simple tasks. As function gets more complex, GPU efficiency degrades, and Larrabee should not exhibit such behaviour, but its performance is still not very impressive to me. This all, however, is a speculation, because there’s no official information about actual clock frequencies and/or number of cores.
It seems that Larrabee may be a good competitor for current GPUs, but it will probably have hard time competing with GT300…

I still dont understant what you compare to. If you compare Larabee to a single GTX295 then I can, basically, buy a i7 now, put 4 GTX295 in it

and it will run 4 times faster then a single Larabee machine, no?

Also according to your calcs (i know its educated guesses, but nevertheless… :) ) - Larabee II will probably have similar perf compared to GT400…?

Is that a reasonable assumption? that means that in 2-3 years it will be equal…

guesses and guesses and guesses … :)

eyal

I compare hypothetical Larrabee with 16 ‘cores’ (each of which is P54C CPU with 512 bit vetor unit) running @ 2GHz to single GTX 295. Also, you’ll probably be able to install 2 or 4 Larrabees in single machine as well, so it makes sense to compare at device level, not on overall system performance, IMHO.

And I have no idea of what Larrabe 2 or GT400 will look like :-) What I meant to say is that Larrabee’s performance is not so ground-breaking as Intel tries to tell us.

Thats an interesting post… I guess it is relevant to Larabee as well…
[url=“http://www.ddj.com/go-parallel/blog/archives/2009/06/david_a_bader_e.html;jsessionid=UECPRBAW0UL2GQSNDLOSKH0CJUNN2JVN”]http://www.ddj.com/go-parallel/blog/archiv...LOSKH0CJUNN2JVN[/url]

read from there down…

DB: The problem we face today is we haven’t given our programmers the tools to understand data movement and locality. Since the 1960s, we teach students that a compute operation, a multiply, an addition, is expensive but all memory accesses come for free. That was true through the mid 1990s, but as memory access started to become expensive, we introduced caches, prefetching, and other techniques to maintain the illusion that memory is fast and close.

We’re still not teaching students at most schools in most programs that the expensive part is the data movement.

DD: Even cache access is painful?

DB: The cache gives the illusion that we have a very large memory running at processor speed. As our processors have gotten faster, memory has gotten faster, but not at the same rate.

As we go to multicore, we have added pressure on the cache. After two to four cores sharing a cache, we find that they start to stomp on each other.

DD: Cache architecture wasn’t really designed for multicore.

DB: Correct, nor was it designed for today, when processor speeds are so much faster than memory speeds. Yet we haven’t taught our students and our programmers to exploit locality which is need to get good performance from cache.

And then there are applications which don’t perform well with cache. For instance, if you have a business application which just jumps around in memory …

DD: … doing lookups …

DB: … you end up pulling in a full cache line even though you only need a byte or a word.

DD: So the caches slow things down.

DB: Yeah. I sense that I’m preaching to the choir.

Nothing really performs that well when there’s random memory access. nVidia could improve it, but they probably don’t have that much motivation for their primary target / revenue source (video games). I assume Larrabee will have instructions to skip the cache (it is x86), for those few scenarios. Overall, I like the idea of hardware doing memory management – it definitely makes programming easier.

Also, as far as education, it depends where you go, what class you’re in, and who’s teaching. And for some things, e.g. digital circuit design, arithmetic ops are expensive.

Which, unfortunately, may be Larrabee’s fatal weakness in those cases where you have random access in a large memory space (that can’t be cached).

This is a huge problem since you have tons of compute power that may be idled by latency.

Much of NVidia’s hardware design is involved with keeping massive threadcounts in flight to hide these latencies. It works really well… I have never had a latency limited kernel since CUDA can have so many warps (up to 32!) to choose from if some (or most!) are waiting for memory.

Larrabee, however, doesn’t have this. It has 4 hyperthreads to choose from… not 32 warps. It may be really easy to stall all 4 threads, and if so, the poor Larrabee core just has to idle itself. So memory latency will be a real issue for some apps.

For apps that fit into the cache, it’s likely no problem though.

It’s a big (and interesting) tradeoff… and for some apps it will make a huge difference.

Maybe your app doesn’t require so much bandwidth that it needs to use memory coalescing (or has so much computation the memory access is hidden)? Scoreboarding only works up to a point: when all blocks require memory, they will definitely saturate the bandwidth, and if they can’t be coalesced, it could be many times slower, no?

I agree the gt200 is faster than the CPU though.

Intel have just announced they have cancelled the first iteration of Larrabee as a commercial product, apparently because of lateness caused by both software and hardware problems, and less than spectacular performance.

More on that…

[url=“http://news.cnet.com/8301-13924_3-10409715-64.html”]http://news.cnet.com/8301-13924_3-10409715-64.html[/url]

[url=“(S/A) Intel kills consumer Larrabee, focuses on future variants”](S/A) Intel kills consumer Larrabee, focuses on future variants