Does cudaMemcpy burn CPU or allow other threads to run while busy? Same if I call a kernal synchronously - the current thread is blocked of course, but is it playing nice and sleeping, or is it in a hard loop burning CPU?
Just wondering as my CPU load seems to go UP when I apply the GPU to an algorithm over just using the CPU…