It looks like the current NVIDIA OpenCL compiler completely ignores the ‘inline’ keyword. I have recently been investigating a large (~2x) performance drop of my OpenCL code on a Tesla C1060 device. It turned out that using a function in two places in a .cl file caused the OpenCL compiler not to inline it, even though it was declared with the ‘inline’ keyword. There have been no such problems with the CUDA version of the code. As a temporary workaround, I was forced to ‘inline’ the function manually , which is ugly and which I would like to avoid in the future.
Has anyone had similar experiences with their OpenCL code? Is there some other way to force the compiler to inline selected functions?