Have a look in the CUDA_Toolkit_Reference_Manual at the section about cudaSetDeviceFlags. AFAIK OpenCL doesn’t have a matching function set, so scheduling is probably set to auto, and the heuristics have probably changed:
To quote:
cudaDeviceScheduleAuto: The default value if the flagsparameter is zero, uses a heuristic based on the
number of active CUDA contexts in the processCand the number of logical processors in the systemP. If C>
P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while
waiting for results and actively spin on the processor.
cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for results from the device. This can
decrease latency when waiting for the device, but may lower the performance of CPU threads if they are per-
forming work in parallel with the CUDA thread.
cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This
can increase latency when waiting for the device, but can increase the performance of CPU threads performing
work in parallel with the device.
AMD probably always yields, while on modern processors, it seems that CUDA (I’m guessing that OpenCL is the same) will almost always spin rather than yield.
You can try creating more contexts than cores and see if that changes behavior (assuming that OpenCL works the same and doesn’t just always spin or unite contexts)