OpenCL compiler ignores inline keyword

It looks like the current NVIDIA OpenCL compiler completely ignores the ‘inline’ keyword. I have recently been investigating a large (~2x) performance drop of my OpenCL code on a Tesla C1060 device. It turned out that using a function in two places in a .cl file caused the OpenCL compiler not to inline it, even though it was declared with the ‘inline’ keyword. There have been no such problems with the CUDA version of the code. As a temporary workaround, I was forced to ‘inline’ the function manually [1], which is ugly and which I would like to avoid in the future.

Has anyone had similar experiences with their OpenCL code? Is there some other way to force the compiler to inline selected functions?

[1] http://gitorious.org/sailfish/sailfish/com…a82cdf2b28797a1

I’m pretty sure everything is inlined by the OpenCL compiler (as in CUDA), so I’m surprised this makes such a difference. If you can provide a full repro case we can file a bug.

You can reproduce the problem using the publicly available code of my sailfish project, which is hosted at Gitorious. Here is a sample command line session illustrating the steps (run on a GTX 280):

$ git clone git://gitorious.org/sailfish/sailfish.git

$ ./lbm_ldc.py --benchmark --backend=opencl --lat_w=512 --lat_h=512

Using the "opencl" backend.

# iters mlups_avg mlups_curr

1000 532.18 532.18

2000 529.43 526.68

^C

$ git show cf4be78c01a36d2f6c506974aa82cdf2b28797a1 | patch -R

$ ./lbm_ldc.py --benchmark --backend=opencl --lat_w=512 --lat_h=512

Using the "opencl" backend.

# iters mlups_avg mlups_curr

1000 298.72 298.72

2000 299.75 300.78

The second column (mlups_avg, MLUPS = Million Lattice site Updates Per Second) shows the performance decrease. The patch to revert is the same as the one I linked to in my previous post (it manually inlines the getDist() function inside the caller).