I think by default the CPU thread that waits on a GPU kernel to complete will keep polling the GPU, in order to detect the termination of the kernel ASAP. Such polling keeps the CPU busy. This is reminiscent of how old DOS programs used to query the keyboard for the key press event rather than waiting on the keyboard interrupt.
I invoke cudaSetDeviceFlags(cudaDeviceBlockingSync) at the GPU initialization stage and am observing about 10% of CPU utilization while my kernels are being executed on the GPU. I’m not sure if cudaDeviceScheduleYield flag is more relevant. I don’t know how much longer it takes for the CPU thread to detect the kernel completion in cudaDeviceBlockingSync mode. In my case of a large number of complex kernels this delay seems to be relatively small. My platform is Linux, but I don’t see why Windows would be any different in this respect.