OpenCL or CUDA?

Hi Guys,

My friend says that CUDA is passe, and OpenCL is the new thing, even according to Nvidia.
What do you think? And please indicate also if you are objective.

Thanks

I’m very positive to OpenCL and it’s been OK for me on Nvidia cards.

One objection currently is that OpenCL ( at least last time i checked) lacks some functionality that in eed that is present in CUDA (certain atomic functions etc).

My major objections are about cross platform performance, you need to write specific code to either Nvidia or AMD to fully utilize those architectures.

Code suitable for CUDA might only be able to utilize 20% / %25 of the AMD VLIW5 / VLIW4 compute units. Furthermore AMD seems to be changing their arch to come close to Nvidia, going from VLIW5 → VLIW4 —> GCN which to my understanding will look a lot like Fermi. So why code for an arch that is changing a lot ? With CUDA on Nvidia cards i can expect near perfect scaling from generation to generation.

Once the AMD arch is more static it will be a much more attractive even if cross platform performance might still not be perfect.

Lack of C++ features such as templates is what kills OpenCL for me.

Templates? Cuda doesn’t even support double. I mean it’s so low level programming that I can’t even consider that as a factor. Why do you need templates?

But what about its main feature that it can utilize the CPU and GPU in the same time, with the same code?

How does it do it anyway? Since I got the feeling with all the occupancy calculation, all the memory types points, byte alignments, and other nonsense that CUDA code needs to be specialized towards a GPU, and as a consequence the CPU is a different animal altogether. Thus I can’t even fathom a code that can generalize and be efficient for both cpu and gpu.

Yes CUDA supports doubles.

Templates are great for generic code and believe me you can definetly write nice generic code with CUDA.

  • Possible both for CUDA and OpenCL ( see ocelot ), what about it ?

All the “nonsense” also exists on a CPU but it’s ( unfortunately ) quite obscured from the programmer. I haven’t even concerened myself with cross platform performance CPU ↔ GPU but I remember Gregory Diamos posting very positive results for Ocelot ( CUDA on CPU… ), seemed to indicated that this programming model is nicely applicable on the CPU as well.

I will admit I have so far been watching OpenCL from a distance, as I know someday I will be porting my code to OpenCL. The few articles I’ve looked at have indicated that OpenCL is far from “write once, run everywhere (efficiently)”, requiring substantial customization to get reasonable performance on AMD vs. NVIDIA vs. CPU. The main problem is that the AMD architecture emphasizes (small, like 3 or 4 component) vector instructions, whereas the NVIDIA architecture is a scalar architecture with the vectorization happening only at the thread level. And CPUs are such completely different beasts that the tradeoffs between arithmetic and memory throughput change everything.

If you focus just on NVIDIA, then in most cases OpenCL should be comparable to CUDA, assuming you don’t need the extra CUDA language features. Honestly, the hardest part of either OpenCL or CUDA is figuring out the data parallel programming style they both require. Whichever one you pick to learn is fine.

And what does it mean to indicate if I’m objective? Does having a lot of experience with CUDA disqualify my opinion? :)

If you just want to learn something new, it doesnt matter so much which one you start with. Its easy to switch from one to another. There are even auto converters out there.

If your computer has an AMD card, learn opencl. But if you start with CUDA, just dont forget to play with the driver API.

Opencl presently seems more suited for consumer-level software for which compatibility is really important. I suppose thats the way you’ll go?

For HPC currently there really isnt the need to go through all the hell opencl has to offer. Just stick with CUDA and MPI.

Dont expect anything this short to be objective, though :)

Jimmy, I only heard about Ocelot, and when you reminded me of it, does it mean that I can finally debug my kernel code using the cpu? (Sorry don’t have two computers to run nsight.)
Also the point was that OpenCL can utilize any processing unit on your system seamlessly in the same time with the same code. Which is really my objective here. I have a cuda enabled card, but I thought that it would be nice to use the cpu in the same time, and make the code run even faster. I’m not sure how Ocelot works, but would it allow me to exploit all the resources in my system this way transparently?

@hyqneuron, I worked with CUDA a bit. What am I missing in the driver api? I never touched it before.
What do you mean by ‘the hell that opencl has to offer’? It’s an important factor, if it’s not easy to use. I’m suffering with CUDA as it is, I don’t find it natural enough, I need to architect my app specifically for it, and I tend to use it only on extreme bottle necks.

BTW, @cbuchnerl, what do you think about Robert’s second comment on the following thread regarding templates:

I’ll elaborate on my main objective. My main concern is that the application would run faster. I have a cuda enabled card and a strong cpu, and I’m not interested in compatibility issues, I only want to utilize my whole system resources - both gpu and cpu - to make my app run faster.
Initially I chose to develop in cuda, since another friend of mine told me that it’s quite similar to opencl, but since it’s nvidia dedicated, you’ll get access to new improvements faster. Also in general a dedicated api sounds faster to me.

And thanks guys, you are objective enough External Image .

The problem with OpenCL is that there are basically no libraries available for it. If you want to do a sort, scan, FFT, sparse matrix, dense matrix, select, pattern match, etc., you’re on your own! And these are things that you don’t want to implement yourself.

This is the best reason to avoid OpenCL. These functions should be defined by Khronos with every vendor providing their own high-performance implementation, but Khronos has no leadership (as has been obvious with OpenGL) to do this. There is no OpenCL ecosystem of independent developers providing these needed functions. Stick with CUDA. What you want may already be available in an SDK-provided library like CUFFT, CURAND, thrust, or a 3rd-party one like CUDPP or MGPU (shameless plug: http://www.moderngpu.com/ )

sean

Actually I must comment that I totally agree with you, and I find this basic functionality missing in CUDA as well. Take for example linear algebra operations. I was very surprised to find that I needed to implement blas by myself, all the more a SVD algorithm. I mean for the cpu I have endless of these, and who has time to go and implement efficiently basic stuff such as lapack functionality. Nvidia’s cublas is of course useless, since it’s an independent package that uses cuda transparently, and you can’t invoke it from a device code. I remember that I thought about this odd void, and reached the conclusion that since cuda is really low level, expressed in the necessity of your code to specialize for the number of threads, memory alignment, memory type, and other stuff, and thus writing a general kernel library should be hard. This is related to my comment before that I find it hard to program in cuda.
If the opencl architecture would be more flexible in this point, then it would be a big advantage. And it doesn’t really matter which of cuda or opencl is more popular right now, and thus have more resource libraries, but more important would be the api potential and the future.

I read the top google results on “cuda vs opencl”:

http://www.streamcomputing.eu/blog/2011-06-22/opencl-vs-cuda-misconceptions/
http://www.streamcomputing.eu/blog/2010-04-22/difference-between-cuda-and-opencl/
http://wiki.tiker.net/CudaVsOpenCL

http://siroro.co.uk/2011/08/02/gpgpu-programming-languages-compared-opencl-cuda-directcompute/

and I concluded that opencl might have a nice potential in the future, where even the cpu would go multi-core. But nowadays it hasn’t got enough popularity, and the support from the vendors isn’t good enough. Programming in cuda is much more comfortable, since it has nice variety of tools and libraries (which I can’t use inside a kernel, so what’s the point?) and it supports c++. Cuda is faster than opencl, and I’m not sure if it’s because nvidia drivers are newer in cuda, or they are more dedicated; still the fact remains.
Concerning the point that opencl can utilize both cpu and cuda, my intuition tells me that if I have 8 cores more of cpu that behaves like gpu cores, whom I have more than 200, then it sounds quite negligible as a workers addition, and considering the price you need to pay to synchronize between the two, and sacrifice performance for code compatibility, wouldn’t pay in the end in terms of performance. I mean I still need to understand how a dedicated cuda code which is tuned to a specific number of threads, byte alignment, and memory type could fit well on a cpu, in a way that could be productive.
So for now I think I’ll stay with cuda, but keep an eye from a distance on opencl development.

I thought that one big difference between cpu with many cores and gpu is the way gpu hides the latencies , something the cpu does not do at this moment. This is important for performance of the programs which have single instructions executed on multiple data (SIMD). I do not think that many-cores cpu behave as gpu. Maybe the Intel Larabee will do that.

The evolution is definitely converging. A CUDA multiprocessor and a CPU core with SIMD extensions are becoming more similar every day. A compute capability 2.0 multiprocessor effectively has a pair of SIMD engines that each process a 32-wide float instruction every 2 clocks. A Sandy Bridge CPU core (if I’m understanding the Anandtech article correctly) can process two AVX instructions per clock, with each AVX register holding 8 floats. (Interestingly, if you compare single precision MAD between the two architectures at their typical clock rates, the throughput should be pretty close, assuming you can actually keep the Intel pipeline full. NVIDIA still wins by having more MPs than Intel has cores.)

Intel hides the latency with big caches, hyperthreading, and complex out-or-order execution to keep the pipeline full. NVIDIA preallocates registers for every thread and encourages the use of lots of warps per multiprocessor to achieve the same goal. It does look like Larrabee moves (moved? I guess it is sort of dead and reincarnated as Knight’s Ferry) further toward the NVIDIA model, expanding the width of the SIMD engine to 16 floats and expanding the hyperthreading to 4 per core. The closest concept to hyperthreading for NVIDIA is number of active warps per MP, which is 48, so Intel is still very different here.

I expect that Intel will continue to expand the SIMD capabilities of the CPU, and NVIDIA will figure out how to more efficiently run less SIMD-oriented code without sacrificing too much throughput. (Execution on a single MP of many kernels to keep occupancy high? Variable sized warps? Put a 64-bit ARM core on the GPU die?! Sign me up!)

Wow. Lots of information. Thanks. The present CPUs are cool. They are powerful and quite effective for running everyday programs, but in the same time they have lots of junk instructions which are kept for legacy code. For many high performance computing they are useless and combined with inefficient compilers becomes even worse. The gpu programming is the future got high performance computing. I made some tests for my own research with FFT and I found that a Tesla card which has a TDP of 225 W was as fast as 15x100 W (AMD hexacores Opterons from the server) + the infiniband communications which is about 10-20 times more power for computing the same thing.
Both CUDA and OpenCL have thier own good and bad points, but as a scientist I prefer something I can program relatively easy and get a reasonable speed up. Usually I run programs on homogeneous clusters and recently on Tesla cards. I use FFT aand converting my codes and testing took me less than 2 weeks on CUDA. I think CUDA is the best for me.

I concluded that opencl might have a nice potential in the future, where even the cpu would go multi-core. But nowadays it hasn’t got enough popularity, and the support from the vendors isn’t good enough. Programming in cuda is much more comfortable, since it has nice variety of tools and libraries (which I can’t use inside a kernel, so what’s the point?) and it supports c++.

The fact that you can’t call CUDA libraries from kernels isn’t a flaw in them. This is inherent in GPU architecture. You can’t call complex functions from inside a kernel because you’ll need to launch a grid of threads, and that can only be done from the host. You can look at GPU coding as a bunch of map and reduce operations. The map is implemented with a grid launch… So don’t avoid these libraries because they’re not callable from device functions. They’re implemented the way they are because it’s the right way to program the device.

(I’m not sure what you meant by “a bunch of map and reduce operations”.)

Okay, they might be useful if they could compete with host libraries, but still the fact remains that they are quite useless if for a CUDA programmer, and a cuda programmer shouldn’t have any preference for them over any other host library. For example let’s say that I’m currently in my device code and I have 3 threads which need to multiply a 3x3 matrix by its transpose. Transferring the matrix to the host memory, calling a procedure from cublas, and then transferring the matrix back to the shared memory, would be the same as using a standard cpu blas library, which should work even faster.
So again correct me if I’m wrong, all the cuda libraries are quite useless to cuda programming and they are just a cpu libraries which use cuda, right?

I agree that designing a library for cuda programming would be a bit more complicated than a host library. The user should supply parameters that describe the available warp threads, and the library should exploit them as best as it can. But let’s say that a library that would address a common thread configuration would be good enough for me. For example for the above matrix multiplication if such a library would address the situation of 9 threads available and the situation of 1 thread available, it would be good enough, and I won’t have to reinvent the wheel and re-implement common algorithms that have been lying around for 50 years, and have countless host libraries that implement them (- but without a useful source code which I can at least tweak in a bit and convert into a cuda code).

What you want requires very dynamic load balancing which cannot be efficiently done on current NVIDIA hardware. Judging from what was talked about in a recent lecture by NV’s chief scientist NV is probably doing something with load balancing… but who knows

Anyway. If you really want that style of programming, you can check out MIC. MIC is more in line with that paradigm.