Cuda OpenCL comparison cuda, openCL, nvidia

I was wondering if anyone could recommend a good comparison of CUDA and OpenCL. I was wondering if it is better to start a new project with OpenCL as opposed to CUDA. OpenCL seems to be more portable than CUDA but will probably lag behind in terms of advanced features. It seems to be that for an application programmed with CUDA there’s the slight disadvantage of requiring the user to have access to NVIDIA hardware, whereas with OpenCL the end user could use almost anything. Is there a performance trade-off for the portability of OpenCL?

I have some code written in CUDA but I wouldn’t mind transitioning it to OpenCL if there are big advantages to doing so.

Best,

-Nachiket

Well, the OpenCL syntax is based on the CUDA driver API (there’s the low-level driver api and the high-level runtime api, you can mix and match in CUDA). OpenCL hasn’t been around as long as CUDA, so the compilers aren’t quite as optimized yet (so your code will run a bit slower than equivalent CUDA code). Also, since OpenCL doesn’t have some platform-specific features, the code will always be a bit slower than CUDA since you won’t be able to tune it to the hardware as well.

It’s really just a matter of preference. You can write something in CUDA and get somewhat better performance (but be tied to nVidia hardware), or write it in OpenCL if you want the portability.

You need OpenCL for portability. If you are afraid that CUDA and NV might vanish tomorrow, you should have your production code in OpenCL.

In other words,

OpenCL - Choice of language for managers for their developers
CUDA - Choice of language for developers

My 2 cents (drawn for using CUDA on several medium to large size projects, and OpenCL for a medium size project):

    As mentioned above - from the programmer point of view, both APIs are rather similar, so for a novice programmer I’d say it’s alike amount of effort to learn any one of them, and for CUDA experienced programmer it should be rather easy to switch to OpenCL quickly.

    OpenCL has the advantage of supporting multiple platforms from the same code base, theoretically; but in my experience so far this is kind of myth - for maximum performance, you just have to tweak your kernel(s) for each platform specifically, and maintaining this kind of code on the longer term is not fun at all. I’d also say that related factor here is that OpenCL in reality actually doesn’t really have that broad support: seems to me that AMD is still not convinced in GPGPU, NVIDIA is obviously not that much interested, Intel seems to be cooking up something (Ct) on its own, Microsoft has DirectCompute (silly API, but it is certainly going to be used by large number of MS-only programmers), there exist significant number of efforts in trying to build higher-level APIs (like Ct), etc.; practically, I’d say at the moment Apple is the only company strongly standing behind OpenCL, and admittedly this doesn’t mean that much.

    CUDA has the advantage of having much more mature implementation at the moment - drivers are just better, and it’s much easier to get optimum performance from your CUDA code than what is the case with OpenCL. CUDA also has much more mature eco-system built around it: more high-performance libraries are available for CUDA, accompanying tools like debuggers and profilers are more mature, and generally there is much more people having good CUDA knowledge that may able to help you during your development (just compare discussions on this forum say with discussions at OpenCL section of AMD forum, and you’ll see the difference for yourself). Another related CUDA advantage is that API development seems to be vibrant so far, as there is only one instance of making decisions, so it is easy to have API changed to accommodate new hardware features; on the other side, OpenCL is much more like design-by-committee, and I’d expect API to lag in the future (just like what was/is the case with OpenGL API).

So, overall I’d say it boils down to the question are you able to influence, at least to some extent, to hardware purchase of your clients - if you could, then it’s safe bet to go with CUDA, and at the moment I’d go with OpenCL only in case you don’t know about the hardware your customers will use, and that you are really convinced OpenCL will gain some strong momentum in coming period.

From what I have seen Ct is not in competition with CUDA or OpenCL, but rather to be implemented on top of it. Just like TBB is built on pthreads or whatever is available on the platform. So looking at the layers it will more be:

    High level programming: Ct

    Intermediate level programming: CUDA Runtime API on NVIDIA GPUs

    Low level programming: OpenCL or CUDA Driver API

Of course I left out a lot of stuff here, like the host side, which OpenCL and Ct probably both will cover properly, and IMHO outdated stuff like Brook+ and CAL. In addition, one may hope that Ct might automatically balance your work between host and device somewhat.

Of course as long as Intel hasn’t shown its card all of this is pure speculation from the little pieces of information publicly available and I might just as well be completely wron.g

I might have a misconception of OpenCL, but I believe that for OpenCL you have to ship your sourcecode, as the source is compiled just in time on the platform people use it on, where with CUDA you can just ship a binary…

@theMarix: Intel Ct is indeed much higher level than CUDA (and OpenCL), but still I think (of course - if anything actually come out of this) it is going to be competitor to CUDA. Namely, for many programmers, higher-level abstraction like Ct is going to be much more preferable than CUDA, even if not providing that optimal performance (and on the other side, I’d expect Ct to draw much from RapidMind guys work, and they already showed they are able to offer higher-level solution, with support to multiple back-ends and with decent performance). There is also lots of efforts alike to Ct for some niche, but still very important, markets: for example, I’m aware of several efforts on supporting GPU programming from Matlab, then there is PGI Accelerator stuff (from PGI - same guys that are supporting CUDA Fortran; PGI Accelerator is something like OpenMP, but for GPUs), etc. I think most of these solutions are actually kind of CUDA competitors, and also that most of them will survive, at least for while (after all, my impression is that even on this forum, throughout last couple months or so, we are witnessing that more and more programmers are approaching GPU programming, but also that many of them are simply are not up to par with doing stuff on such low-level as required by CUDA, or OpenCL).

@E.D. Riedijk: In OpenCL, you could compile your code to an intermediate representation (practically, PTX), and you could deliver this instead of C source of your kernels. I know it is not the same thing as machine code generated by nvcc, but it is actually alike, and if an attacker is determined enough, reverse engineering is in principle possible in both cases. So if this aspect is of critical importance for some codes, I’d say in both cases one would have to approach some other known solutions to this problem, like code encryption etc.

You can only compile to ptx if NVIDIA GPUs are your target. And if that is the case, you are better off with CUDA…

So for any company that does not want to deliver source code (obfuscated or not) OpenCL is actually not an option. It’s ok in the academics world, but otherwise.

The more I learn from OpenCL, the more I don’t like it (I was very happy about it at first ;))

Well, you could build binary representations for several devices (ex. a ptx for CUDA-enabled GPUs, CAL or what they use for AMD GPUs and native binary for a CPU implementation) and load them up based on device queries.

It’s not pretty but, when you think of it, if you want true portability, you either need a virtual machine of some sort that would interpret bytecode at best or you redistribute source code one way or another so that it can be compiled for the client’s hardware. You can’t have true portability with native binaries by definition, it’s not really OpenCL’s fault. Some more in this thread http://forums.nvidia.com/index.php?showtopic=155965 .

@E.D. Riedijk: To make a small addition to Big_Mac clarification above: AMD OpenCL compiler is transforming the code into LLVM virtual machine assembly code. The new Fixstars FOXC compiler is translating it practically to SSE assembly instructions. Etc. - so I’ve actually intended to mention PTX above as an example only… In any case, most of the time “binary” representation of compiled OpenCL code is actually still in the some kind of readable form, but then again reverting CUDA machine code back to PTX should be doable by a determined hacker. So I’d still say this is not something that differentiates existing CUDA tools from corresponding OpenCL tools that much.

On the other side: I was enthusiastic about OpenCL at the beginning too (always preferring standard over proprietary solutions). But just like you mentioned it, after actually spending some time in using it, and thus hopefully being able to more realistically assess its position on the market, at the moment I see almost no advantage in using it over CUDA, for the case when working on the low level of abstraction chosen.

I also prefer CUDA, but in my opinion there is one thing missing. A supported platform to execute PTX code on the CPU. A Windows Version of Ocelot that is supported by NVIDIA is what I want.

You may ask why do you need that? The simple answer is, that not every PC has a CUDA capable GPU and developers don’t like to do the work twice and support a second platform. CUDA Code and a OpenMP version of the same algorithm.

For me there are two reason to use OpenCL. The first is platfom independence and the second and probably more important point is, that you can execute OpenCL also on the CPU (only as a fallback solution).

In fact, converting from CUDA machine code back to PTX has been possible for several years now:

http://wiki.github.com/laanwj/decuda/

Lets you convert between the .cubin format and human-readable PTX. (Of course, PTX is not very fun to work with, so this is no more help for reverse engineering than disassembling a normal CPU executable.)

Somwhow, Right from inception, I have the feeling that OpenCL will be a dead technology… Strictly my personal opinion

Oh man, at least for me, OpenCL is very difficult to work with.

First of all, not all OpenCL functions return an error code, which means you can no longer use macros to check for errors.

Second, OpenCL does not support any C++ features AFAIK. In CUDA, I have several (actually, quite a few) headers that are used for both host and device compilation. And in those headers, I am using templates so that I can keep one version of the code for single and double precision. Because CUDA kernels are created at compile-time, it is very easy to just use include directives and the nice C++ features provided by CUDA. In OpenCL, I would have to make substantial modifications to the code, and most annoyingly, have a separate version for device and host code.

I do like the fact that OpenCL can run on non-NVIDIA platforms, but I feel it needs more work still. I plan on writing OpenCL code along with CUDA code, but the CUDA code will always carry the burden of being tested and optimized. OpenCL code for me will be a boring and tedious unpacking of already optimized CUDA code, at least until OpenCL gains some of the features I adore in CUDA.

So, to give you a suggestion, as a developer, I would say don’t bother with OpenCL, more than having an OpenCL capable module in your program; and definitely keep your CUDA code.

Alex

The only reasons I can think of to use OpenCL over Cuda are:

  1. Non-Nvidia hardware and
  2. OpenCL lets you compile code at runtime

Other than that, OpenCL tends to be slow, verbose, and buggy when compared to Cuda.

The CUDA runtime API supports JIT compilation at runtime as well. In fact, if you were being uncharitable, you might refer to OpenCL as the “poor man’s” CUDA runtime API. But only if you were being uncharitable, of course…

JIT compilation only works on PTX assembly code, so I wouldn’t really call it compilation per say. With OpenCL, you can compile from C source code at runtime.

On Mac OS X platform, OpenCL support is integrated on actual OS, and it show some advantages, as being able to run OpenCL code on CPU, nVidia GeForce 8xxx and later GPU, and even Radeon 4xxx GPU (5xxx not available on Mac actually), flawlessly.

But even with this advantage, I think CUDA is more mature at this point if you want to obtain real-world high-end GPU performance, and moreover, having a good knowledge of CUDA and nVidia architectures is mandatory to exploit OpenCL efficiently too. So coding today for CUDA while preparing to use OpenCL when it will be truely efficient.

For CPU, I use a different path, and for Radeon 4xxx I prefer not to comment their real-world openCL performance-level, that made them usefulness!

I am actually seeing a big slow down running CUDA over OpenCL. Can’t work out why (I started this thread on it if anyone has ideas: http://forums.nvidia.com/index.php?showtopic=161470).

Assuming that is some weird compile flag issue I’d say the difference I’ve seen are:

  • No runtime compilation in CUDA (can assemble PTX at runtime, but to generate them you need NVCC).
  • Less sophisticated synchronization primitives in CUDA (CUDA “streams” versus OpenCL “events”, there is some overlap but I beleive there are some things you can’t do with stream, you can do with event).
  • In CUDA you must explicitly “push” and “pop” the context every time is used in concurrent threads.
  • No PTX equivalent in OpenCL (of course if you’re using Nvidia’s OpenCL you can get the actual PTX data).

Check VexCL library for easy to use templated OpenCL: https://github.com/ddemidov/vexcl .