Any reason to choose CUDA over OpenCL?

alexgg · April 23, 2010, 9:31am

I’m just starting to learn OpenCL and wondering if I should be learning CUDA instead of or in addition to OpenCL.

I understand that OpenCL is more cross-platform and is supported by both AMD and NVIDIA, and it runs on both CPUs and GPUs.

However, are there reasons to prefer CUDA? Does it have any advantages at this point? It seems that even NVIDIA is subtly pushing developers to OpenCL, judging by the language of their programming guides.

Lev · April 23, 2010, 11:08am

For example, if you plan to build own personal supercomputer based on nvidia gpu, cuda is best choice. Also you may target Fermi architecture that is more advanced than ordianl gpu. And often developers write cuda version for nvidia gpu and opencl for Radeon, because of anyway they need to change program for radeons.

cbuchner1 · April 23, 2010, 11:18am

template support. Third party libraries, such as thrust, CULA, and others.

Nice and large developer community here in these forums.

MisterAnderson42 · April 23, 2010, 12:16pm

runtime API is simpler to learn and use than OpenCL’s explicit driver API.
A lot more educational material is already out there for CUDA, such as a text book and this course: [url=“http://groups.google.com/group/many-core-processors-workshop-2009”]http://groups.google.com/group/many-core-p...s-workshop-2009[/url]
CUDA already supports most of Fermi’s new bells and whistles.

And OpenCL’s multi-platform support is overrated. Most experts I’ve talked to say that you need to write separate kernels specifically optimized for NVIDIA GPUs, ATI GPUs, and CPUs to get decent performance out of each architecture.

YDD · April 23, 2010, 4:04pm

Have they quantified that at all? One could argue that templating is over-rated, since to get optimised memory transactions, you’re going to need a different kernel for uchar as opposed to float.

MisterAnderson42 · April 23, 2010, 4:28pm

John stone talked about his experiences in an OpenCL tutorial: see http://www.ks.uiuc.edu/Research/gpu/files/…ntrowebinar.pdf, page 27/42 and the benchmarks on page 33.

templating by data type indeed is not always advantageous due to memory issues. But templating with different computational operation functors is immensely useful.

tmurray · April 23, 2010, 4:50pm

I think there’s a pretty convincing argument that it’s easier to get a feature added to CUDA than OpenCL.

alexgg · April 23, 2010, 6:00pm

Very interesting. It would be great to see this sort of thing for more architectures and problems.

As to the portability of OpenCL’s performance, it seems that in the above slides, the same (scalar) kernel did best on both GPUs tested.

alexgg · April 23, 2010, 6:15pm

It’s my understanding that there is a fixed cost (100-200 lines of code) to basically call the compiler at runtime in OpenCL (I don’t know CUDA). After that, things get pretty equivalent, so for larger applications, it shouldn’t matter. Is this correct? (I don’t know CUDA, and I’m roughly past those 200 set-up lines in my OpenCL learning curve)

You’re right, although NVIDIA’s “OpenCL best practices” has a textbook feel to it.

Is this only true for CUDA and not OpenCL? The CUDA 3.0 download page mentions, e.g. “Fermi HW profiling support for CUDA C and OpenCL in Visual Profiler”. I didn’t notice anything that’s CUDA C exclusive. What are those features?

alexgg · April 23, 2010, 6:25pm

That’s basically the plan, except I’m not sure about the architecture.

I think my application will be bottlenecked in dense matrix by sparse matrix multiplications.

The sparse matrix would be approximately square of size 2000x2000 or so with maybe 5-10% non-zeros. The dense matrices would be about 2000x(20-100). Because of the sparseness, the memory access would probably be non-coalesced. As far as I know, the more recent NVIDIA GPUs can handle non-coalesced access, but I don’t know about e.g. Radeons, etc.

By the way, thanks to everyone who replied so far. Much obliged.

MisterAnderson42 · April 23, 2010, 6:40pm

In CUDA, calling a kernel is as easy as

kernel<<<n_blocks, n_thread_per_block>>>(arguments)

I haven’t taken the time to learn OpenCL yet, but don’t you need that 100-200 lines of code for every single kernel? Converting a production app that has 100+ kernels seems a daunting task to me.

I’m referring to the new intrinsics like floating point atomic adds, the new __ballot() warp voting mechanism, and new syncthreads variants (see appendix G.1 of the CUDA programming guide).

Lev · April 23, 2010, 7:13pm

“That’s basically the plan, except I’m not sure about the architecture.”

For now, ATI multi gpu support is not perfect. They may improve it in the future. Actually cuda and opencl is very similar, except some cuda features mentioned above. Opencl is needed for those who want to target multiple architectures. For example, all consumer 3d cards.
Based on your task desription you may benefit from Fermi caches cause they help with non-coalesced access. Radeons are more vector oriended with large warp size and strict coalesce rules.

alexgg · April 24, 2010, 7:22am

In CUDA, calling a kernel is as easy as
kernel<<<n_blocks, n_thread_per_block>>>(arguments)
I haven’t taken the time to learn OpenCL yet, but don’t you need that 100-200 lines of code for every single kernel?

No, e.g. if you put all your kernels into one file, you use this boilerplate code to read the file and compile it. Then, to extract a kernel from this “compiled bunch of kernels”, it’s just one line per kernel. Something like:

cl_kernel my_kernel = extract_kernel(my_program, "my_kernel");

where “extract_kernel” needs to only be defined once. This just calls clCreateKernel and checks the error code. This can be made terser still with a simple macro.

Setting up the arguments is another matter though. On page 11, The OpenCL JumpStart Guide shows the syntax for this in CUDA and OpenCL. Both seem atrocious.

CUDA:

cuParamSeti(cuFunction, 0, pDeviceMemA);

cuParamSeti(cuFunction, 4, pDeviceMemB); 

cuParamSeti(cuFunction, 8, pDeviceMemC); 

cuParamSetSize(cuFunction, 12);

OpenCL:

clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA); 

clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB); 

clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC);

The kernel call itself can be abstracted down to 1-2 lines per call. Perhaps setting up the arguments can be done at the call site in one line with some macrology. What seems troubling though is that the above doesn’t look very type-safe.

All this looks fairly equivalent between CUDA and OpenCL, unless the “<<<…>>>” thing obviates the need to set the kernel arguments and does static type-checking. Does it?

The NVIDIA OpenCL Programming Guide, mentions in Appendix A.4 that “Devices of compute capability 1.1 and higher support the cl_khr_global_int32_base_atomics, cl_khr_global_int32_extended_atomics, cl_khr_local_int32_base_atomics, and cl_khr_local_int32_extended_atomics extensions.” I don’t know if it’s the same as what you are talking about though.

alexgg · April 24, 2010, 7:41am

I’m curious about the specifics. What are their multi-GPU problems?

ATI’s top consumer card actually comes with two GPUs (with 2 GB VRAM and 4.6 TFLOPSs peak performance per card).

Is there some documentation that discusses the Radeon coalesce rules in detail? I have their “OpenCL programming guide”, but it hardly mentions this topic.

Lev · April 24, 2010, 10:42am

"Is there some documentation that discusses the Radeon coalesce rules in detail? "
It is hard to obtain.
“ATI’s top consumer card actually comes with two GPUs (with 2 GB VRAM and 4.6 TFLOPSs peak performance per card).”
I am not talking about hardware. You may try to search multi gpu opencl applications for Radeon…

Lev · April 24, 2010, 11:12am

Btw, another reason is that opencl is now open source. You need to distribute your source code with your program.

tera · April 24, 2010, 11:33am

What gets you to that conclusion? OpenCL is a specification, not a program.

Lev · April 24, 2010, 12:48pm

OpenCL tends to compile source at the start of the program. It has not intermediate format like ptx.

Jimmy_Pettersson · April 24, 2010, 1:53pm

In below paper there are two “identical” kernels being run on both CUDA/OpenCL. See the results on page 73 and 102 for a TDFIR filter and an interpolation kernel.

In the more compute bound TDFIR kernel the OpenCL implementation almost manages to catch up to the CUDA implementation while it laggs significantly behind in the interpolation kernel ( bandwidth bound). There seems to be clear issues with coalescing for the OpenCL version. This is probably a driver issue ( we were using beta drivers at the time) and might be resolved.

This was done on Quadro 5800 and won’t tell you anything about how well it performs cross-plattform.

[url=“http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf”]http://www.hpcsweden.se/files/RadarSignalP...sProcessors.pdf[/url]

MisterAnderson42 · April 24, 2010, 3:05pm

You are comparing the CUDA driver API to the OpenCL API. The CUDA runtime API automatically handles kernel arguments correctly and with type checking at compile time.

Topic		Replies	Views
OpenCL or CUDA? CUDA Programming and Performance	16	11079	October 26, 2011
Cuda vs OpenCL CUDA Programming and Performance	49	262421	December 28, 2008
Cuda OpenCL comparison cuda, openCL, nvidia CUDA Programming and Performance	19	42889	November 1, 2012
Cross-vendor GPU development strategy CUDA Programming and Performance	20	6832	January 11, 2010
Bootstrapping with OpenCL! Advice me please CUDA Programming and Performance	21	4845	July 15, 2010
Difference between OpenCL and CUDA Can both co-exist? CUDA Programming and Performance	3	4844	September 8, 2009
Opinions on OpenCL on nVidia/AMD GPUs Is it worth supporting both vendors so I can always use the be CUDA Programming and Performance	14	9543	March 27, 2012
OpenCL - hmm... not so interesting What is your take on it? CUDA Programming and Performance	16	13442	February 20, 2009
OpenCL 1.1 driver, 8 months and waiting... CUDA Programming and Performance	67	10443	July 7, 2011
OpenCL vs CUDA C CUDA Programming and Performance	3	8200	May 7, 2010

Any reason to choose CUDA over OpenCL?

Related topics