Developers have been pushing AMD to move ahead of standard, as it is clear that OpenCL is moving towards C++ kernels as a reaction to push from developers. By implementing it ahead of time, compilers can become stable enough by the time the standard is released. This allows for us developers to code ahead, and also SDK developers to have us beta-test C++ kernel compiler and get rid of most of the bugs. The static kernel language defines the set of minimal functionality that can be supported by older cards (presently on the market, apart from C2070 and HD7970) with no dynamic black-magic. If NV were to adopt the static C++ kernel language in their SDK, it could save us developers heaps of time, plus it would be one more star on the plaque of NV.
I know that NV Is neglecting OpenCL as much as possible (which is understandable from a view), but let this be the first feature request in this direction. Those who find this a worthwhile extension, feel free to add a comment. Right now I am coding a vector based mathematical framework, and having to implement every operator for all the type combinations… it is roughly 1000 kernels. I could reduce this drastically with some macro-hell, but that won’t help code readability. Static C++ kernels would solve all my problems automagically, but unfortunately the research group has NV cluster, so this is no solution.
OpenCL is C99 by the spec, if you want c++ you should use CUDA (what AMD defines is CUDA on compute 1.x where there are no function pointers and thus no virtual functions).
If I’d want to use CUDA, I would do so, but since I do not wish to be tied to a vendor (nor platform for that matter), I don’t. OpenGL and OpenCL are the only libraries I wish to use for my programs, because I wish to keep my code portable across vendors and platforms. I understand OpenCL is C99 language, but there is a really big pressure of making C++ the standard language. Now that both big GPU vendor support C++ natively on their HW, I’m sure that it will make it into the standard eventually.
I do know that CUDA has it’s merits, but I would not like to recode stuff, when it turns out that one vendor takes a direction I (or research groups I’m in touch with) would not like to follow. There is no harm in making life easier, and static C++ kernel language would be neat to use, but nonetheless, I don’t if NV don’t support it, cause that would break code portability.
I’m afraid to tell you that although you can use OpenCL on both NVIDIA and AMD and CPUs for that matter, you will have to recode stuff in order to move to another platform.
Once you look for the 90% - 100% performance you need to take account of the iron, and this is the reason we are in the GPU game in the first place. The portability of OpenCL is just an illusion. I have not ran into a single piece of code that ran optimally on all platforms without changes, including large parts of code that produced the wrong results on different platforms, or didn’t even run at all on some of them.
There are various advantages of OpenCL vs CUDA, portability is not one of them. I did on the other hand run into numerous issues with OpenCL, both in the spec and in the implementation, C++ is the least of them.
When you are trying to venture into the region of 90-100% performance, it is true that OpenCL can be cumbersome, and is definately not portable from there on. However, there have been numerous papers discussing the topic (starting from fortran age), discussing whether performance or portability should be kept in mind when writing code, and it has been decided, that portability pays off better in the long run.
OpenCL is a fastly evolving language, so I know that it is extremely rare to get a piece of code that works just fine on all platforms, some vendors are better at implementing things, some do violate the standards at points… this is because OpenCL is not more than 3 years old with the 3rd spec through it’s life, and CUDA is easier to be kept in hand. (Now that is has become an open compiler, nvcc will surely change too)
It is possible to write portable code, you don’t have to tell me it’s impossible. I write programs that compile on Win7/Linux/OSX out-of-the-box, or at least I try to (gcc is the biggest pain, I must admit). Portability will become easier in time, for now I just have to live with it’s difficulties.
I have also proven, that GPU optimized code due to it’s heavy vectorization and due to the fact that OpenCL compilers (specially Intel) are very adept at (re-)vectorizing code, they perform better than CPU optimized code. (I’m not a guru of CPU optimization, so statement is: naively optimized CPU code) I am aware that I most likely sacrifice 5-10% of performance, but it pays off in development time and the fact that my code is portable.
I know there are issues, but AMD and Intel runtimes evolve the fastest. It is only natural that the need arises for NV to put a little more effort into developing OpenCL.
Sorry for going a little off topic, but I’d like to know what the problem was since I might be using OpenCL for a long term project soon. Was it just bugs in different OpenCL implementations, or different extensions not being supported?
Things that make code not run at all are mainly due to 3 things:
Bugs and quirks in different implementations
Reliance on warp level synchronization - anything that requires any sort of reductions, prefix sums, etc to work well. AMD is really bad in that respect as the warp size (wavefront in AMD terminology) is not constant across the board and there is no extension I know of to find out what it is (either 64 or 32 at the moment)
Differences in hardware, mainly local memory sizes and textures (which are emulated or non-existent on CPUs)
Missing extensions (texture types, printf only exists on some platforms, atomic instruction existence, double precision existence and performance, intrinsic math functions)
Pinned and mapped memory (and actual memory allocation which is a gray area in OpenCL) - for example, allocating very large buffers or a large set of buffers who’s total size exceeds GPU memory. AMD generally still have very big issues with CPU <-> GPU memory transfers. Even NVIDIA is different for concurrent copy and execute on Fermi Quadro / Tesla vs pre-fermi and GeForce (single vs dual copy engines)
Compiler issues - AMD requires ILP for full performance, it needs 5 or 4 (depending on architecture) indenepent instructions to feed the GPU, compute 2.1 needs it as well but to a different extent, Intel on the CPU, possibly also AMD do stranger things to vector instructions, tricks to improve instruction cache usage and cache usage in general
Different optimal and maximal runtime topology (grid and workgroup dimensions) - even with NVIDIA, compute 1.x has a 2D grid and compute 2.x has a 3D grid. Optimal workgroup size on the CPU is 1 according to the documentation (didn’t experiment enough on that platform), maximum on AMD is 256 if memory serves, NVIDIA changes with compute capability
Performance wise there are issues related to hardware features, cache sizes and types (CPU doesn’t have textures, constant or local memory caches), textures are emulated on the CPU, L1 and L2 cache sizes and existence varies. Even it’s performance varies, as AMD use a different path for read and write and their constant cache is very different from the NVIDIA one.
IEEE math compliance is different on different architectures, can affect precision (although OpenCL theoretically requires it and also requires that FMA be disabled by default to comply due to differences in intermediate bit depth on different platforms, or even on the CPU with and without sse)
Even the VLIW architecture is different on different machines. NVIDIA is 1-way, AMD is 4 or 5 way (depending on the architecture), CPU is 128bit (I don’t think that the OpenCL compiler supports 256bit avx yet) that splits according to type size, float is 4 way, char is 16 way.
Even integer performance is very different on different architectures (compute 1.x likes 24bit, Fermi works half throughput, I know AMD is even worse but not sure of the details, would be happy for some clarifications here)
I write different code for compute 1.x and 2.x (L1 and L2 caches, TPC vs big multicores, etc.), compute 2.1 requires more work than 2.0 (due to 2 instruction schedulers on 3x1 cores requires ILP to utilize all 48 cores), 1.2/1.3 can take short cuts that 1.0/1.1 can’t (coalescing buffers).
AMD is going to fully change their hardware on the next iteration as well (AFAIK they are moving to a non-vliw architecture like NVIDIA unlike current AMD and the CPU do now).
Turned out into a big assay and this is just the tip of the iceberg; and I haven’t even started to venture here into the wonderful realm of APUs.
I think that more or less the only piece of code that I can write almost cross platforms with only two or three versions is convolutions (ignoring memory transfers).
OpenCL reminds me of the joke that a Camel is a horse designed by a comity. It has all the right pieces, just in all the wrong places.
Writing programs that compile on Win7/linux/OSX is relatively easy. Writing cross platform OpenCL code is a pain.
As for OpenCL and evolving fast, it means that some codes are not forward and backward compatible and that there is a lot of deviance from the standard which is pretty gray in several areas in the first place.
As for OpenCL on Intel, I can show you a few papers and real life codes that show how well (or not) automatic vectorization works and how hard you need to work to actually let automatic vectorization work. What you are seeing more than automatic vectorization is automatic multi threading (you are using all your cores rather than the one you would be using with serial code). I’ve ventured into the field of how hard you need to work to actually let the Intel compiler auto optimize and how to give it the appropriate hints and flags. We have in our company a guy that actually wrote a whole course on using the intel compiler properly.
I’m aware of the portability vs optimization approach, both are very hard to achieve and impossible to achieve together. That is why you should use libraries and update them based on architecture. From experience you probably sacrifice closer to 80-90% rather than 5-10%.
If you look for example at libgoto for BLAS, there are different versions for the same generation of xeon processors that depend on the value of n in terms of which n-way the cache is.
There are however two very important guidelines for optimization vs portability:
Keep It Simple and Stupid (KISS)
Use a (good) profiler to determine when and where it is worth while to break the first rule
If you look at the NVIDIA reduction example, you’ll see that there is a factor of x3 or x4 that depends on non-portable optimizations (code that will cause deadlock or wrong answers on other platforms)
At the end of the day though, that is why I have a job, for when you need to call in the ninjas to do the surgery at night and leave you with fast code.
And if anyone I know is reading this, I probably gave away my true identity with this post ;-)