Migrating from CUDA to OpenCL - higher register consumption


I’ve just migrated my program from cuda to opencl. It involved a bit work to change all the host code, like device initialization, memory allocation, kernel execution etc.

For the device code (kernels) changes were very small hovewer and boiled down to replacing __syncthreads() with barrier(CLK_LOCAL_MEM_FENCE) and changing shared memory allocation from:
(CUDA) shared float var_in_shared[100];
(OpenCL - in kernel function signature) __local float * var_in_shared;
(and in clSetKernelArguments for this argument: size = 100*sizeof(float) and value NULL)

That were the only changes made.

Here the story of nice and easy migration ends: The performance of opencl version is near 2 times lower.
I investigated the problem and found out that kernels in cuda version consume much less registers than kernels in opencl version! (and it caused lower occupancy = lower number of active blocks).
For example:
-cuda: 18 registers
-opencl: 39 registers

-cuda: 14
-opencl: 25

Kernels were compiled for sm_13 arch both for cuda and opencl.
I haven’t played with any nondefault (like fastmath etc.) optimizations options neither in cuda or in opencl.

I took register usages from -Xptxas -v (cuda) and from build log from clGetProgramBuildInfo in opencl.

What is causing such increased registers consumption - poor opencl compiler?