CUDA optimization vs. CELL For those who asked...

Hi all,

About a month ago I logged on and posted to thread asking what people were using CUDA for. I can’t find the thread, but a bunch of people asked me to post some information about how I went whilst porting one of the brute force engines across to the G80 platform.

The engine in question performs RSPC followed by DC component analysis of the resulting signal.

Well, for those people, here are the salient details:

Intel Core2Quad @ 3.4 Ghz (using 1 of 4 cores) = 376.87 million permutations/second.

nVidia GTS-8800 (320Mb) @ 1.2 Ghz/1.6 Ghz DDR (using 96/96 cores) = 4.66 billion permutations/second.

nVidia GTX-8800 (768Mb) @ 1.35 Ghz/1.8 Ghz DDR (using 128/128 cores) = 6.65 billion permutations/second.

I now will rewrite the CELL engine to incorporate a number of new optimizations I hve discovered whilst writing the CUDA version, but for the sake of reference, the previous CELL version running on 6 SPU’s @ 3.2 Ghz ran at about 5.25 billion permutations/second.

I’ll post updates for an apples to apples comparison in the near future - with a very interesting “bang for buck” and power consumption graph…

Hope this is what those who were asking were looking for.

RPS.

Thanks for posting your results, that’s not a bad speedup. RSPC is some kind of Reed-Solomon error-correcting code?

It would be interesting to see price-performance versus Cell also.

Yes, correct.

I have actually increased the speed of the engine in the last few hours, so it’s now exceeding what I thought the limits for the card were (5 billion itterations/sec on the GTS == 70 GB/sec) so I will clean it up today and let it lie there for now.

Back to the CELL work after that, but I’ve got to admit, the cards did much better than I ever thought they would - hell, I’ve seen slower RSPC implementations in FPGA’s… :)

RPS.

After seeing this I went Googling around to find where to get PCI-Express boards with a Cell on them. The press release from Mercury put them at $7999 in July 2006. It also looks like IBM sells blade servers with 2 Cell chips for $9995. For “small” problems, that definitely gives CUDA on the 8800 GTX (or Tesla) a big price-performance lead.

Hi

What is the application domain for this code? Is this for wireless codes?

Is your code open-source or something you are doing as part of work at a company?

Sumit

Except for one thing: The PS3 running Linux costs $650 Australian (~$400US) - and there is a price drop coming in the next quarter… so that changes the “bang for buck” graph drastically… :)

If we’re doing military hardened CC/TA stuf, we’d use Mercury - but for generic consumer/commercial applications, the PS3 (which only uses 6 SPU’s instead of the Mercury’s 8) is more than good enough - add to that the fact that it only pulls 120watts (according to our measurements) that’s even more damned impressive.

RPS.

Hi Sumit,

The domain for this code is actually in a mastering system (DVD/BD) that we’re prototyping. It does full EFM, EFM+, ETM and 17PP - the bruteforce portion is to test the various outputs of the encoder against known disc limitations prior to master cutting.

It’s for a commercial enterprise and about as closed source and you can get. :)

RPS.

That’s a good point. It’s too bad there is such an enormous price difference between the PS3 and the Mercury PCI-Express cards. A coprocessor board would be much easier to integrate into our analysis tasks rather than having to recompile everything to run natively on the PS3. (Nice that the PS3 has a gigabit port, though. That at least makes NFS a decently fast option.)

Plus, there is always the fun part of trying to explain to your boss how a video game console really is “work related.” My Ph.D. advisor was skeptical at first that a graphics card targeted at gamers would actually be an incredible number cruncher. :)

The reason that we stayed clear of PS3/Cell was it’s crappy hardware OpenGL support under Linux. As we’re mainly doing visualization and animation oriented stuff, this was a big show stopper.

The PS3 has an attractive price, but its measly 256MB makes it pointless as a computing platform for our application.

Something that I don’t think has been brought up yet in the comparison of the GPU to the Cell is the programming difficulty. In my experience programming the Cell is much more difficult to program especially if trying to use the data-parallel instructions. Granted, one has more flexibility with the Cell than the GPU, each SPE can be doing a completely different task. In situations where the Cell, GPU, or multi-core with SIMD give similar performance, I’ll prefer CUDA every time. CUDA is just a lot easier to program.

Someone already mentioned the low amount of memory available on the PS3 (pretty painful if already running Linux as well). Something else to be aware of is that one of the SPEs is disabled on the PS3. You have to decide if the Mercury card or IBM blade are justified in their price to have that additional SPE and more memory.

There is no doubt that NVIDIA cards give serious bang-for-the-buck and CUDA is what really makes that performance accessible.

Sorry, I am compelled to answer and state that I disagree entirely here. CELL is much, much easier to program efficiently for than CUDA. It is amazingly straight forward and I find it strange that anyone could say otherwise - unless they haven’t actually programmed on CELL and are just regurgitating what they’ve read “on teh internets”. Furthermore, there is none of the issues with respect to memory coalescing, warp management, slow host to device transfers, etc.

The low memory requirements are a part of the architecture, smart design and programming (using streaming) will negate much of the impact, just like on any other platform. One of my implementations actually does real-time decompression and re-compression on the fly - there is more than enough available power on tap.

True, the PS3 has access to only 6 SPE’s. The Mercury cards feature access to all 8 SPE’s and greater accessible memory; but it comes at a significant cost. In my applications, I have found that only 6 SPE’s @ 3.2 Ghz are more than a match for a GTX-8800.

Do the basic math: A single PS3 costs us ~$400 US. A single GTX-8800 is still more expensive, then add to that the required PC to host it (with it’s own PSU, RAM, CPU and all ancilaries). Then go and factor the power usage - it all very quickly becomes a no brainer in my books.

Furthermore, from someone who actually has used both systems and is pretty proficient at both of them now, I can tell you that the memory subsystem on the CELL, it’s vectorised engines and the flexibility of the MFC are, when combined together, the essence of what make it the brutual performer that it is.

No one said that they didn’t, but when you put them head to head against the PS3 option, no one can say that in a comparison of “bang for buck”, not to mention the power requirements of each, then the game changes significantly.

Personally, I don’t care less what the system I’m writing for is, but I do care when people make what appear to be non-objective comments on an internet support forum that others will read in future - and may then base important decisions upon. I consider this quite important, as I recal that I looked through numerous forums doing as much research as I could prior to starting this project; and this place gave me some very good insights. I’m only trying to give a little back for those that are in the same boat.

I like to think what would have been if there was a CUDA style interface for the G70 (RSX) - or for that matter, if a G80 was in the PS3 - just imagine what that little black box could do… :)

RPS.

There isn’t even a decent normal accelerated GL interface you can use from Linux on that device, let alone something more advanced.

Part of the architecture, right. Smart programming won’t help you if the dataset you have to process is larger. And streaming? What do you want to stream over, the network? That’s even worst than the Host<->Device interface via PCI-E, which isn’t slow but very fast (up to 3.3GB/s. Gigabyte, not gigabit, mind you)

To obtain the full power of the CELL it is necessary to make use of SIMD instructions (not doing so is leaving untapped performance), and unless the programmer has experience programming with SIMD instructions doing so can make programming the CELL challenging, in my opinion. This requirement also creates code that appears significantly different than “normal” C code forcing either new development or substantial alteration of existing systems. In comparison, porting a C based application to CUDA can be quite straight forward. Yes, obtaining maximum performance requires some thought to how the problem is decomposed and aligned with the hardware but the core of the code remains accessible by many programmers. And while the NVIDIA hardware is essentially implementing vector hardware, CUDA and the NVIDIA hardware hides the work of actually converting to explicit SIMD instructions. I write this without bias. I have tried both platforms. I recognize the at least some benefits of each, but the greatest need is for environments, hardware and software, that enable the greatest number of people to develop high performance programs, and the environments must be evaluated in terms of capital cost and programming effort.

Out of fear of sounding like sounding like I’m ‘regurgitating what they’ve read “on teh internets”’ readers interested in comparisons may be like to read the following IEEE and Dr. Dobbs’ articles. Both are very informative and present the CELL in positive light but also acknowledge some short comings. The IEEE article is especially interesting as it compares the same function implemented on FPGAs, CELL, and CPU, and the 7900 GTX (would be nice if the authors would update their comparison).

http://doi.ieeecomputersociety.org/10.1109/FCCM.2007.43

http://www.ddj.com/hpc-high-performance-computing/197801624