Cuda vs OpenCL

what is the difference between OpenCL and Cuda?? Why Nvidia (group Khronos) is agree with OpenCL???

I know a bit about OpenCL, but have never tried it. It seems to not be far along yet. However, it and CUDA are fairly similar. They both use blocks, warps, shared memory, and other such things. From the wikipedia entry, it looks like the programming interface is similar to cuda’s Driver API, and less user-friendly than the wonderful Runtime API.

Another difference is that OpenCL uses LLVM and Clang, which are interesting low-level technologies used by compiler writers. It’s to be seen, however, how that will impact programmers. (LLVM is a bytecode that can allow things like on-the-fly recompilation and optimization. Clang is a compiler front-end that, among other things, supports easier debugging and ‘intellisense’ than gcc. CUDA, btw, is based on Open64 compiler technology.)

It’s good that there will be competition in this space. At the very least, if OpenCL supports both ATI and NVIDIA (which, with Apple’s backing, it should), then CUDA will surely start to support ATI cards too.

@ Alex,

Thanks for this info. With ATI dishing out faster and faster graphics cards cheaper than NV cards – OpenCL can be an NV nemesis unless NV comes up with faster-graphics cards @ a competitive price. (ATI has cards that r faster than GTX280s and priced very less). I am sure NV guys would be prepared for this challenge.

I am only looking at the memory bandwidth, and there they are still slower. And memory bandwidth is where most of our pour souls are bounded by. Also for ATI to really be competitive they need to come up with S1070-alike solutions, which is where money can be made and what large clusters owners (read large customers) will buy.

Thanks for your inputs. Yeah S1070 - the blade solution is really innovative one from NVIDIA and can be integrated into the blade farms of any compute-data center.

btw, I did not know about the memory bandwidth thing from ATI. Probably they have a cache to offset it ?? (jus guessin)…

I have not peeked into their “Stream computing”. But there have been reports that ATI graphics card are significantly faster than GTX 280 and priced less… Does that not mean that OpenCL applications would also run faster when supported ??

thank you!

You don’t really know what the performance on ATI’s memory subsystem is. Theoretical numbers aren’t too different, and you have no idea what their coalescing rules, warp size, latency, etc. are. Latency, in fact, should be halved since their numerous SMs run much more slowly.

Also, I have a feeling Off-the-Shelf solutions like S1070 will crop up soon. They’re actually pretty easy for a third-party to make, and will have an enormous price advantage vs proprietary S1070 by being compatible with commodity GPUs. Plus, I think when quad-PCIx16 becomes more common in servers, people will just go for direct-mounted nodes. It’s a slightly less complicated, cheaper, and higher performance arrangement.

From reading a post on gpgpu.org I found some interesting info:
http://www.gpgpu.org/forums/viewtopic.php?p=21154

Very interesting article is to be seen here discussing who is involved in OpenCL and where it came from:
http://www.hpcwire.com/blogs/OpenCL_On_the…k_33608199.html

It says also that a technical briefing for OpenCL is due on Monday 17th Nov 08:
http://www.khronos.org/news/events/detail/opencl_sc08/ (you can signup to a newsletter here)

I imagine from reading the article that OpenCL is going to extend Apple’s work on OpenCU (since Apple handed it over to Khronos earlier in the year) for which a technical specification is available here:
http://www.wipo.int/pctdb/images4/PATENTSC…0/18/00f018.pdf (WIPO document PCT/US2008/004648)

It looks quite similar to the Cuda driver api. It would be wonderful if ATI were to support OpenCU and if Nvidia would provide an OpenCU abstraction layer for Cuda then we’d all need to relearn very little for cross-platform GPU computing. This document doesn’t discuss anything like the Cuda runtime api, so maybe that would remain Cuda’s trump card for ease of use, but for those seeking portability using an OpenCU abstraction and providing different a binary for each architecture from each vendor’s compiler would be the way to go.

Matt

Yeah, I also read AMD is going to make their stream computing available on normal cards. And they will partner with some other company to build tesla-like rackservers (although that one will be 4U for 6 cards if I am not mistaken). Apart from that, there does not seem to be anything happening to update their SDK and stuff, it looks like they want to be not forgotten while they wait for OpenCL. Having once looked a bit at their stream computing SDK, I cannot imagine one would use that when CUDA is also available.

Personally I have only used the runtime api, and will only switch to driver API for use in production, the runtime API is just too convenient.

Thanks for the links Matt. The oPenCL site says “It is open and royalty free…” – SO, I think there is no need to worry about IPO…

I could not open this link though.

Best Regards,

Sarnath

AMD’s pretty good at fooling the press actually: http://www.theinquirer.net/gb/inquirer/new…ream-everywhere

Looks like some ignorant tech writer bought the press releases and doesn’t have a clue about real technology.

Another way to measure is to look at the AMD tech forums: http://forums.amd.com/devforum/categories.cfm?catid=328

They aren’t exactly busy, and the posts that are there aren’t very meaty.

try this link instead

http://forums.macrumors.com/showthread.php?t=588206

see also discussion on OpenCL in this other thread

http://forums.nvidia.com/index.php?showtopic=69731&st=0

general overview here:

http://s08.idav.ucdavis.edu/munshi-opencl.pdf

I follow charlie’s articles @ inquirer a bit… And, there is a general disgust among readers for his anti-NVIDIA stand… Whenver he gets a chance, he bashes NV.

I think the patents would NOT be on OpenCL… but on Apple’s implementation (the run-time compiler bla bla bla…).

Otherwise, we should only call it as “ClosedCL” or “AppleCL”

OK both the links are dead. The forum one somebody posted died ages ago (which is why I turned to wipo) and mine seems to expire with a session timeout. For those interested:

Go to http://www.wipo.int
Resources->Search ip database->paste PCT/US2008/004648 and change to patents
Then search and you should be able to navigate to the document with the document tab (first result should take you to the right page), and download the PDF with 78 pages (first one in ‘related documents’ on my search).

THank you Matt… That one worked!!!

A bit off-topic, but not really since we’re talking about the OpenCL patents. I found an article in the The Register a few days ago. Take a look at the advice Microsoft recently sent out to its developers regarding reading patents:

From: http://www.theregister.co.uk/2008/11/13/mi…rance_is_bliss/

I think the one that creates a decent compiler and reliable scheduler first will win. The compiler is probably the biggest setback with CUDA right now… I’ve got one kernel that in C uses 170k+ registers… but when I wrote the same thing in ptx it used 64 and less than 2/3rds the instructions. As far as scheduling, fix the status and event query problems, queue kernel executions in the runtime library so every application doesn’t need its own scheduler.

Just my two cents.

170,000 registers? I assume that’s a typo and you mean 170?

How are you measuring those? If you just emit PTX, you won’t get real assembly code. Real optimization occurs in translation from PTX to CUBIN via ptxas. Also, nvcc uses single-assignment when emitting PTX (a ‘register’ is never reused) so that ptxas will have an easier time optimizing. Decompile cubins using decuda to see what your kernel actually looks like after it’s compiled.

Moreover, the compiler actually makes small difference in the scheme of things. Ok, let’s assume you got a 2x boost hand-coding assembly. But are you sure you have perfected coalescing, bank conflicts, serialization, divergence, rewrote your algorithm to use shared mem effectively, and are putting critical arrays into registers via loop unrolling? Because if you didn’t do just one of those things, then you’re worrying about the wrong thing.