Hi board,
I’ve implemented various algorithms in CUDA and OpenCL (PTX looks pretty much the same), but have noticed that the overhead for a single OpenCL call is larger than for CUDA. The next thing I noticed was that CUDA was using 100% CPU during the whole execution of host and kernel code, whereas OpenCL let the CPU idle.
I stumbled over cudaDeviceScheduleSpin and understood that CUDA is using one of my CPU-cores to actively spin, waiting for the result. The big question now is:
How do I make OpenCL spin too?
Documentation for CUDA here: http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__DEVICE_g18074e885b4d89f5a0fe1beab589e0c8.html#g18074e885b4d89f5a0fe1beab589e0c8