I’ve implemented various algorithms in CUDA and OpenCL (PTX looks pretty much the same), but have noticed that the overhead for a single OpenCL call is larger than for CUDA. The next thing I noticed was that CUDA was using 100% CPU during the whole execution of host and kernel code, whereas OpenCL let the CPU idle.
I stumbled over cudaDeviceScheduleSpin and understood that CUDA is using one of my CPU-cores to actively spin, waiting for the result. The big question now is:
How do I make OpenCL spin too?