Any reason to choose CUDA over OpenCL?

I’m just starting to learn OpenCL and wondering if I should be learning CUDA instead of or in addition to OpenCL.

I understand that OpenCL is more cross-platform and is supported by both AMD and NVIDIA, and it runs on both CPUs and GPUs.

However, are there reasons to prefer CUDA? Does it have any advantages at this point? It seems that even NVIDIA is subtly pushing developers to OpenCL, judging by the language of their programming guides.

For example, if you plan to build own personal supercomputer based on nvidia gpu, cuda is best choice. Also you may target Fermi architecture that is more advanced than ordianl gpu. And often developers write cuda version for nvidia gpu and opencl for Radeon, because of anyway they need to change program for radeons.

template support. Third party libraries, such as thrust, CULA, and others.

Nice and large developer community here in these forums.

And OpenCL’s multi-platform support is overrated. Most experts I’ve talked to say that you need to write separate kernels specifically optimized for NVIDIA GPUs, ATI GPUs, and CPUs to get decent performance out of each architecture.

Have they quantified that at all? One could argue that templating is over-rated, since to get optimised memory transactions, you’re going to need a different kernel for uchar as opposed to float.

John stone talked about his experiences in an OpenCL tutorial: see http://www.ks.uiuc.edu/Research/gpu/files/…ntrowebinar.pdf, page 27/42 and the benchmarks on page 33.

templating by data type indeed is not always advantageous due to memory issues. But templating with different computational operation functors is immensely useful.

I think there’s a pretty convincing argument that it’s easier to get a feature added to CUDA than OpenCL.

Very interesting. It would be great to see this sort of thing for more architectures and problems.

As to the portability of OpenCL’s performance, it seems that in the above slides, the same (scalar) kernel did best on both GPUs tested.

It’s my understanding that there is a fixed cost (100-200 lines of code) to basically call the compiler at runtime in OpenCL (I don’t know CUDA). After that, things get pretty equivalent, so for larger applications, it shouldn’t matter. Is this correct? (I don’t know CUDA, and I’m roughly past those 200 set-up lines in my OpenCL learning curve)

You’re right, although NVIDIA’s “OpenCL best practices” has a textbook feel to it.

Is this only true for CUDA and not OpenCL? The CUDA 3.0 download page mentions, e.g. “Fermi HW profiling support for CUDA C and OpenCL in Visual Profiler”. I didn’t notice anything that’s CUDA C exclusive. What are those features?

That’s basically the plan, except I’m not sure about the architecture.

I think my application will be bottlenecked in dense matrix by sparse matrix multiplications.

The sparse matrix would be approximately square of size 2000x2000 or so with maybe 5-10% non-zeros. The dense matrices would be about 2000x(20-100). Because of the sparseness, the memory access would probably be non-coalesced. As far as I know, the more recent NVIDIA GPUs can handle non-coalesced access, but I don’t know about e.g. Radeons, etc.

By the way, thanks to everyone who replied so far. Much obliged.

In CUDA, calling a kernel is as easy as

kernel<<<n_blocks, n_thread_per_block>>>(arguments)

I haven’t taken the time to learn OpenCL yet, but don’t you need that 100-200 lines of code for every single kernel? Converting a production app that has 100+ kernels seems a daunting task to me.

I’m referring to the new intrinsics like floating point atomic adds, the new __ballot() warp voting mechanism, and new syncthreads variants (see appendix G.1 of the CUDA programming guide).

“That’s basically the plan, except I’m not sure about the architecture.”

For now, ATI multi gpu support is not perfect. They may improve it in the future. Actually cuda and opencl is very similar, except some cuda features mentioned above. Opencl is needed for those who want to target multiple architectures. For example, all consumer 3d cards.
Based on your task desription you may benefit from Fermi caches cause they help with non-coalesced access. Radeons are more vector oriended with large warp size and strict coalesce rules.

No, e.g. if you put all your kernels into one file, you use this boilerplate code to read the file and compile it. Then, to extract a kernel from this “compiled bunch of kernels”, it’s just one line per kernel. Something like:

cl_kernel my_kernel = extract_kernel(my_program, "my_kernel");

where “extract_kernel” needs to only be defined once. This just calls clCreateKernel and checks the error code. This can be made terser still with a simple macro.

Setting up the arguments is another matter though. On page 11, The OpenCL JumpStart Guide shows the syntax for this in CUDA and OpenCL. Both seem atrocious.

CUDA:

cuParamSeti(cuFunction, 0, pDeviceMemA);

cuParamSeti(cuFunction, 4, pDeviceMemB); 

cuParamSeti(cuFunction, 8, pDeviceMemC); 

cuParamSetSize(cuFunction, 12);

OpenCL:

clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA); 

clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB); 

clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC);

The kernel call itself can be abstracted down to 1-2 lines per call. Perhaps setting up the arguments can be done at the call site in one line with some macrology. What seems troubling though is that the above doesn’t look very type-safe.

All this looks fairly equivalent between CUDA and OpenCL, unless the “<<<…>>>” thing obviates the need to set the kernel arguments and does static type-checking. Does it?

The NVIDIA OpenCL Programming Guide, mentions in Appendix A.4 that “Devices of compute capability 1.1 and higher support the cl_khr_global_int32_base_atomics, cl_khr_global_int32_extended_atomics, cl_khr_local_int32_base_atomics, and cl_khr_local_int32_extended_atomics extensions.” I don’t know if it’s the same as what you are talking about though.

I’m curious about the specifics. What are their multi-GPU problems?

ATI’s top consumer card actually comes with two GPUs (with 2 GB VRAM and 4.6 TFLOPSs peak performance per card).

Is there some documentation that discusses the Radeon coalesce rules in detail? I have their “OpenCL programming guide”, but it hardly mentions this topic.

"Is there some documentation that discusses the Radeon coalesce rules in detail? "
It is hard to obtain.
“ATI’s top consumer card actually comes with two GPUs (with 2 GB VRAM and 4.6 TFLOPSs peak performance per card).”
I am not talking about hardware. You may try to search multi gpu opencl applications for Radeon…

Btw, another reason is that opencl is now open source. You need to distribute your source code with your program.

What gets you to that conclusion? OpenCL is a specification, not a program.

OpenCL tends to compile source at the start of the program. It has not intermediate format like ptx.

In below paper there are two “identical” kernels being run on both CUDA/OpenCL. See the results on page 73 and 102 for a TDFIR filter and an interpolation kernel.

In the more compute bound TDFIR kernel the OpenCL implementation almost manages to catch up to the CUDA implementation while it laggs significantly behind in the interpolation kernel ( bandwidth bound). There seems to be clear issues with coalescing for the OpenCL version. This is probably a driver issue ( we were using beta drivers at the time) and might be resolved.

This was done on Quadro 5800 and won’t tell you anything about how well it performs cross-plattform.

[url=“http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf”]http://www.hpcsweden.se/files/RadarSignalP...sProcessors.pdf[/url]

You are comparing the CUDA driver API to the OpenCL API. The CUDA runtime API automatically handles kernel arguments correctly and with type checking at compile time.