winrar, winzip or 7zip on GPU like the topic says

Nope, just 4 threads. This makes sense, because the x86 compute model has quite a few fixed registers plus the new fat SSE-like math registers. For hyperthreading you have to be able to store those natively (ie not on a stack) and that takes transistors.

Nope, Larabee isn’t very superscalar, but it DOES have two execution pipes, one is the full x86 stream and can do anything, the other does just the common math and memory ops.

The texture caches are coherent only because they’re read-only! That means you could never have cores get different values anyway. This is also why CUDA only lets you define textures between kernels (no such thing as creating a texture now and then declaring it to be “cacheable” later. It’s really likely the GPU hardware would have no problem with you writing to a textures, it’s just that you’d get undefined results when reading those values unless you’ve never read them before.

To be honest, exposing that “write to texture” would STILL be useful, especially when used for local memory like effects. Yes, other SM’s would not be able to read that data reliably, but that’s fine, if it’s locally used only it’s not important. NV could expose this with a simple crude __flushcache() kind of opcode, it means “assume all your texture cache for this SM is dirty, reload things as you re-read them now.” So you might create a small precompute table in global memory, go __flushcache(), then repeatedly use that table in your block. It’d be even better to be able to invalidate just certain ranges of the texture cache, but even if it’s an all-or-nothing option it’s useful.

Of course what we want is not just more control over the cache, but also to make it BIG… even if that means it’s off-die and higher latency. Because of shared memory, it’s awkward to process data bigger than a few K in size right now, so even 128K per SM would be very comforting headroom.

I’m dumbfounded. I had been duped into thinking Intel was making a GPU… you know, that completely different architecture which complements a serial processor and will become mainstream in the future. Instead it’s pulling a Cell. A half-assed hybrid between serial and parallel that sucks at both and doesn’t even understand the discussion.

When I first learned about GPU architectures, years ago, before CUDA, I was in love with the idea. Especially the hyperthreading. Instruction latency, memory latency, pipelining, branch prediction, instruction reordering, all the modern problems just poof disappear. With SIMT all the scheduling and decoding logic also disappear and you can dedicate silicon to pure unadulterated FLOPs. It was also so cool just a couple years ago when NVIDIA and ATI both realized they can do away with vector code too, since SIMD can fold into SIMT.

Decades of Moore’s law-driven crud and hacks melt away, revealing elegance.

You lose some freedom along with that too, of course. Needing thousands of threads and forcing many of them to march in step potentially sound like critical deficiencies, if you’re an old-school chip designer looking at the drawing board thinking in the box. But we’ve all here tried coding for warps and blocks, and it’s not so bad. (There’s tricky things like coallescing and bank conflicts and other nuances of the NVIDIA arch, but the core principles of SIMT, the warps and blocks and grids, go over pretty smooth.) This paradigm is a winner.

I’m astounded. Just in utter disbelief. That after seeing the future, along with the rest of the industry, of the coming of these massively parallel processors (AMD nearly killed itself swallowing ATI!), that they’d take their billions of dollars, their 20x market cap, their incredible resources, and produce this. Wow.

And I bet it won’t even run Crysis.

Yeah, that’s what I was thinking. A writeable texture cache without the extra syntax. (Just a regular pointer and cudamalloc, or “cudamalloccached”.) Thing is, coherency between MPs isn’t even imporant. You mostly can’t coordinate between blocks anyway (__syncblocks() is conceptually impossible), you can’t write algorithms that share global memory between blocks (even though it’s perfectly “coherent”), except for special cases. (Like… in your example, you’d have to compute the table in one kernel, then use it in the next. No matter what. Otherwise, which block initializes the table? How do the other blocks know the table is ready?)

Now… having off-die memory is unfortunately a no-go. The problem’s not latency the problem is the bandwidth. It can’t be any better than DDR since you’ve got to use pins and traces and all that nonsense. Speaking of which… the hypothetical writeable non-coherent cache would NOT be bandwidth-limited like the tex cache!

Actually the hybrid feels pretty well designed, but you’re exactly right, using transistors for a complex CPU will actually limit your compute power… better to have 2 simple cores than 1 clever one. Intel’s done a lot of clever design with the core intercommunication and caches. It’s really nicely done, in theory.

In practice, nobody knows if it will really work well! The exact hardware details aren’t known (perhaps not even by Intel yet!) and it’s really unclear whether the market wants something in that design spot. And there’s many markets from games to servers to supercomputing… which is it aimed at?

If Larrabee isn’t good at games, it won’t get volume, and won’t be cheap to the end user, and will die as a niche product. (Think Itanium. A fine CPU, but the small niche made it expensive which made its niche even smaller.) It’s pretty amazing that NV’s produced a GPU design that does games AND general purpose stuff so well. [In fact G200’s changes were clearly biased towards optimizing CUDA, not gfx!]

Anyway, about Larrabee, read the SIGGRAPH paper, lots of interesting small clues in there as offhand single lines. I’m looking forward to playing with one eventually, but I have my doubts that it can outcompute or outgame an NV board… the KISS simple GPU power is going to beat the expense of a general CPU (even though a CPU is a lot easier to code for.)

http://softwarecommunity.intel.com/UserFil…ee_manycore.pdf

Readability is my main concern with CUDA. And, if the “MCUDA” thing that is coming up will work well on Larrabee as well – I guess people will just stick with CUDA :-)

Wow… that’s very interesting. CUDA could run on Larabee.

The chip’s x86 compatibility is starting to make sense now. Still, it ain’t no GPU. It’s just a better multicore CPU. The pundits have it all wrong.

Yeah, I have a feeling that this will be a really big problem. Intel is marketing it as for games, but it’s clear that it’s the completely wrong architecture for rasterization/DirectX. This will cause a huge backlash in public opinion. The architecture sounds very good, actually, for just general multi-threaded x86 code, and automatically compatible with all those apps that have been written. That’s a great sell. But Intel’s not marketing it in that direction… at all. Of course the number of these well-threaded apps is smaller than the number of games. But Intel is shooting itself in the foot bigtime with its bs. Sure… the stock market and everyone else are eager for Intel to attack the GPGPU model that will be so big in the future. But Intel’s not doing that. And its stock will especially plummet when people realize this.

Ahh, silly, silly Intel. It’s so big, it’s too big to change with the times. I bet the idea behind Larabee is almost a decade old, and when management finally realized there’s a new direction to be pursued, they just said “well… this other project we’re working on can sort of do that too…”

Just read through the Larabee paper. Apparently Windows won’t be able to just run x86 code on Larabee as if it were a collection of additional CPU cores?

Garbaaaaaaaaageeeee

Just like the Larrabee paper, sometimes you need to really study NV’s docs for one-sentence throwaway lines that matter.

Look at this one in the programming guide, 5.1.2.4.

Wow! This shows that some of my mental assumptions were very wrong. The texture cache is even less like a CPU cache than I thought… it’s not just one-way (read only), but it also doesn’t handle any queries out of order (which is why there’s no latency savings.)

So when we say “if only the texture cache could be written to…” it’s really such a different kind of cache it’s not a small change.

This thread should be retitled, we left the original topic a long time ago. :-)

Oh yeah, the tex cache is weird. It doesn’t give you any better bandwidth than DRAM (although you can use it and access dram at the same time), and it doesn’t have low latency. I think the only advantage is that it doesn’t need coallesced accesses. (Although I’m not certain… I don’t use it myself.) You’re right, we probably want something altogether different. Of course, its real point is that it allows filtering, which can be a huge computation if we’re talking about 16x anisotropic trilinear. (up to 128 values for a single pixel!)

Speaking of which… I wonder how larabee will handle aniso. I remember when the XGI Volari V8 came out, and it couldn’t do it all. Again, because of the difficulty of making an efficient texture cache. Poor xgi…

Definately

Yeah, it is weird. It’s kind of like a scratch lottery ticket, sometimes you get a payoff, but there’s something useful in it. You know the rows-aligned-by-16 paradigm w/ regular global memory? The texture cache is useful if you can make use of a memory structure that you may get lucky with if the data tends to be near each other. That’s the abstraction - and it turns out to be useful. Damn, I can’t express it, but locality is different than coalescedness. Coalescing is Assembler-flavored - locality is geometry-flavored. Makes for different abstraction styles.

It’s very much a legacy OpenGL optimisation artifact, but it’s useful.

soooo.
has anyone tried to make a simple compression routine to see how well it can compress files?

what made me originally ask the question is because badaboom does this same thing dont it?

it can take an mpg2 file and compressit to an avi file…its all data compression in the end isnt it?

zeros and ones

so that is what made me originally ask if there was something like that for raring or zipping.

To be fair to Larrabee, it would probably suck if they tried this–suddenly you’ve got completely different threading and memory access penalties, plus PCIe latency (or maybe not if it’s in a CPU socket, but then memory access penalties are even WORSE), and it may or may not support SSE identically… yeah, it would be a nightmare.

(which is to say that Larrabee is not a magic bullet, and you’re still going to have to rewrite your apps and face a lot of similar problems to CUDA)

Okay, call me a sceptic but I really don’t see CUDA’s future -multicore working on Larabee, knowing how much Intel and nV hate each other ;)

are you both being rude on purpose?
the topic is clearly stated.

please dont hijack.
make a new thread if you want.

I, for one, would think it’s pretty cool and would even consider implementing it in my spare time just to see it. (That’s my general opinion of Larrabee–it’s a cool design, but I am very skeptical that it’s really practical.)

g000fy, the subjects of topics are allowed to evolve in this forum. Please do not attempt to stifle that.

goofy, no one has an answer for you. Btw, badaboom’s video transcoding is very different from zip compression. Video transcoding is inherently very parallel, uses completely different algorithms, and has a lot of people interested. No one seems to care about ordinary compression. Sorry.

(I guess it’s because the big mainstream filesharing formats are either video, which is compressed its own way, or ISOs, which don’t get zipped at all. Of course lots of software uses compression, like file backup or application installers or whatnot. But it’s all fragmented and proprietary and can’t just be hitched onto CUDA. What did you have in mind to use this with?)

Larabee as a GPU is a failure. It’s simply not a GPU! It’s very similar to what Sun is doing with Niagara. There’s basically three different categories of processors. On one end old-school single-threaded CPUs, on the other radical GPUs, and in the middle these multi-core CPUs with light cores. Now, it would make a lot of sense for Intel to go into that middle direction. Your PC would have two big, ubersuperscalar cores for all the single-threaded x86 software that will never die and a bunch of little ones for the multi-threaded x86 software. These little cores are simply more efficient at executing well-threaded but traditional software. So there would be a great niche for Larabee or something similar (slim, in-order, mildly hyper-threaded, nothing too radical) to be available to Windows. That would be a good trajectory. That would be an actual reason to use the x86 instruction set.

But Intel didn’t move into that direction at all. I don’t know what Intel is thinking. It’s this gigantic company with unimagineable resources, yet it’s as befudled as ever.

oh im vey sorry this is not evolving around my topic
dont be rude

i have asked you not to go off topic in my thread i created.

respect that please.

im sure the moderator of these forums is wtching and will be watching if i contact him.

you said transcoding is inherintly multi-core…for all these years con CPU we have been doing it one core.

so the algorythem for it was changed for badaboom.

and that is exactly what i am talking about with doing data crompression with the gpu.

i have asked you politely over and over to not HIJACK this thread…
please dont do it.

your offtopic conversation is being very rude to the thread starter which is me.
larabee has absolutely nothing to do with my topic…
my topic is about CUDA and NVIDIA GPU`S being able to do data compression similar to badaboom transcoding.

Unfortunately, tmurray is the closest thing we have to a moderator right now, and he was participating in the discussion on Larrabee External Media

I’m no expert in file compression myself, so I can’t offer you any great ideas on putting it on the GPU off the top of my head. But I’ll start with a few questions to get you thinking about where some bottlenecks might be.

Correct me if I’m wrong, but aren’t file compression routines basically disk I/O bound? What gain would you hope to see using CUDA? Or are you thinking of speeding compression on data that is already in memory? In this case, you are going to be limited by the PCIe bandwidth: copying the data out to the GPU and back again at ~4 GiB/s. Assuming you can do the compression on the GPU, the PCIe transfers are likely to be a big bottleneck.

Are you compressing many independent streams at once, or just one big stream? Thousands of streams in parallel could be done in CUDA without too much trouble. I believe someone pointed out on the first page that you could split a big stream into many sections and compress them all separately in parallel which then digressed into texture caches and then finally got on to Larrabee… Still, no one here is going to do all your homework for you. And I haven’t seen any previous threads discussing zip type compression algorithms in CUDA. Read up on existing parallel compression algorithms and start implementing one in CUDA yourself! When you’ve got specific questions about how best to get a particular data structure into the GPUs memory, we can certainly help you out. Right now, it’s a bit hard to help, as you have asked a very broad question with thousands of possible answers.