OpenCL performance difference Linux/Windows

Recently, I compiled a C++ program plus OpenCL on a GT560 and an intel i5. The program source and the hardware are identical, but OpenCL performs very poor on Windows, but works quite fast on Linux. Checking the details with the nvvp showed that the running times of the OpenCL kernels are actually identical (source is identical as well), but for some reason or another, the CPU and the GPU do not seem to do much to drive computations forward between launching the kernel. The overhead of launching a kernel seems to be much higher on windows than it is on Linux. Is this correct? Can anything be done here to improve the poor performance on windows?

Furthermore, even though I create the command queue with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, the NVIDIA OpenCL driver does not dispatch kernels out of order - the next kernel always waits for the previous kernel to return, no matter what, even though the kernels are independent. I also tried more than one command queue, but with the same result - only sequential execution of kernels, and no overlapping or re-ordering. Specifically, the times for initiating the kernels remain exactly identical, no matter what I do, so the per-call overhead remains pretty high, specifically on Windows.

What I find specifically annoying that even though the algorithm allows for quite some parallelity, and the kernel is rather fast, the overall overheat kills any performance benefit, and the final application is on Linux only a little faster, and on windows even slower than a purely multi-core CPU impementation.

Any hints or tips to improve the situation?