I’m just starting to learn OpenCL and wondering if I should be learning CUDA instead of or in addition to OpenCL.
I understand that OpenCL is more cross-platform and is supported by both AMD and NVIDIA, and it runs on both CPUs and GPUs.
However, are there reasons to prefer CUDA? Does it have any advantages at this point? It seems that even NVIDIA is subtly pushing developers to OpenCL, judging by the language of their programming guides.
For example, if you plan to build own personal supercomputer based on nvidia gpu, cuda is best choice. Also you may target Fermi architecture that is more advanced than ordianl gpu. And often developers write cuda version for nvidia gpu and opencl for Radeon, because of anyway they need to change program for radeons.
CUDA already supports most of Fermi’s new bells and whistles.
And OpenCL’s multi-platform support is overrated. Most experts I’ve talked to say that you need to write separate kernels specifically optimized for NVIDIA GPUs, ATI GPUs, and CPUs to get decent performance out of each architecture.
Have they quantified that at all? One could argue that templating is over-rated, since to get optimised memory transactions, you’re going to need a different kernel for uchar as opposed to float.
templating by data type indeed is not always advantageous due to memory issues. But templating with different computational operation functors is immensely useful.
It’s my understanding that there is a fixed cost (100-200 lines of code) to basically call the compiler at runtime in OpenCL (I don’t know CUDA). After that, things get pretty equivalent, so for larger applications, it shouldn’t matter. Is this correct? (I don’t know CUDA, and I’m roughly past those 200 set-up lines in my OpenCL learning curve)
You’re right, although NVIDIA’s “OpenCL best practices” has a textbook feel to it.
Is this only true for CUDA and not OpenCL? The CUDA 3.0 download page mentions, e.g. “Fermi HW profiling support for CUDA C and OpenCL in Visual Profiler”. I didn’t notice anything that’s CUDA C exclusive. What are those features?
That’s basically the plan, except I’m not sure about the architecture.
I think my application will be bottlenecked in dense matrix by sparse matrix multiplications.
The sparse matrix would be approximately square of size 2000x2000 or so with maybe 5-10% non-zeros. The dense matrices would be about 2000x(20-100). Because of the sparseness, the memory access would probably be non-coalesced. As far as I know, the more recent NVIDIA GPUs can handle non-coalesced access, but I don’t know about e.g. Radeons, etc.
By the way, thanks to everyone who replied so far. Much obliged.
I haven’t taken the time to learn OpenCL yet, but don’t you need that 100-200 lines of code for every single kernel? Converting a production app that has 100+ kernels seems a daunting task to me.
I’m referring to the new intrinsics like floating point atomic adds, the new __ballot() warp voting mechanism, and new syncthreads variants (see appendix G.1 of the CUDA programming guide).
“That’s basically the plan, except I’m not sure about the architecture.”
For now, ATI multi gpu support is not perfect. They may improve it in the future. Actually cuda and opencl is very similar, except some cuda features mentioned above. Opencl is needed for those who want to target multiple architectures. For example, all consumer 3d cards.
Based on your task desription you may benefit from Fermi caches cause they help with non-coalesced access. Radeons are more vector oriended with large warp size and strict coalesce rules.
No, e.g. if you put all your kernels into one file, you use this boilerplate code to read the file and compile it. Then, to extract a kernel from this “compiled bunch of kernels”, it’s just one line per kernel. Something like:
where “extract_kernel” needs to only be defined once. This just calls clCreateKernel and checks the error code. This can be made terser still with a simple macro.
Setting up the arguments is another matter though. On page 11, The OpenCL JumpStart Guide shows the syntax for this in CUDA and OpenCL. Both seem atrocious.
The kernel call itself can be abstracted down to 1-2 lines per call. Perhaps setting up the arguments can be done at the call site in one line with some macrology. What seems troubling though is that the above doesn’t look very type-safe.
All this looks fairly equivalent between CUDA and OpenCL, unless the “<<<…>>>” thing obviates the need to set the kernel arguments and does static type-checking. Does it?
The NVIDIA OpenCL Programming Guide, mentions in Appendix A.4 that “Devices of compute capability 1.1 and higher support the cl_khr_global_int32_base_atomics, cl_khr_global_int32_extended_atomics, cl_khr_local_int32_base_atomics, and cl_khr_local_int32_extended_atomics extensions.” I don’t know if it’s the same as what you are talking about though.
I’m curious about the specifics. What are their multi-GPU problems?
ATI’s top consumer card actually comes with two GPUs (with 2 GB VRAM and 4.6 TFLOPSs peak performance per card).
Is there some documentation that discusses the Radeon coalesce rules in detail? I have their “OpenCL programming guide”, but it hardly mentions this topic.
"Is there some documentation that discusses the Radeon coalesce rules in detail? "
It is hard to obtain.
“ATI’s top consumer card actually comes with two GPUs (with 2 GB VRAM and 4.6 TFLOPSs peak performance per card).”
I am not talking about hardware. You may try to search multi gpu opencl applications for Radeon…
In below paper there are two “identical” kernels being run on both CUDA/OpenCL. See the results on page 73 and 102 for a TDFIR filter and an interpolation kernel.
In the more compute bound TDFIR kernel the OpenCL implementation almost manages to catch up to the CUDA implementation while it laggs significantly behind in the interpolation kernel ( bandwidth bound). There seems to be clear issues with coalescing for the OpenCL version. This is probably a driver issue ( we were using beta drivers at the time) and might be resolved.
This was done on Quadro 5800 and won’t tell you anything about how well it performs cross-plattform.
You are comparing the CUDA driver API to the OpenCL API. The CUDA runtime API automatically handles kernel arguments correctly and with type checking at compile time.