100% CPU use while waiting for kernel

After I call a kernel, I want the CPU to wait until it’s finished, then copy the results back to the host.

If I do something like:

   myKernel<<<grid, block, block.x*4*sizeof(int)>>>(dCurrentN);    

    CUT_CHECK_ERROR("Kernel execution failed");    

    CUDA_SAFE_CALL( cudaMemcpy( hCurrentN, dCurrentN, MemSize, cudaMemcpyDeviceToHost) );


The Memcpy will wait for the kernel to finish, as expected.

Everything runs fine, but the CPU spins using 100% while waiting.

I read this is to reduce latency… I can’t find that reference anymore, it may have been on the forum and not the docs.

I want the CPU to sleep while waiting, not poll.

I thought the solution was to call cudaThreadSynchronize() after the kernel.

If I add that, my program still runs fine but still uses 100% CPU.

Currently I can hack an ugly short-term workaround to the pegged CPU by manually putting a Windows Sleep() call after the kernel… but that’s not what I want since that’s a hardwired time delay, bad for many reasons.

This is on an old G80 board as well as a new GX280, both using the latest CUDA 2.0 beta SDK on Windows XP.

How can I avoid the 100% CPU while waiting for a kernel to finish?

Short answer:
cudaMemcpy has an implicit cudaThreadSynchronize in it. cudaThreadSynchronize spin waits for the kernel to finish to give you the absolute minimum possible latency between kernel finishing and your CPU code continuing.

With CUDA 1.1 and newer, you can set an event after calling the kernel and then start a loop checking the status of cudaEventQuery. Inside the loop, put a nanosleep (I don’t know the equivalent windows function) or whatever you like. Latency will be greatly increased but you should no longer have 100% CPU usage.

Edit: The long answer is somewhere on the forums posted just after CUDA 1.1 was released. cudaEventQuery should be a good keyword to search for.

Thanks for the reply. The polling method with The Sleep works, but in Windows that has a 1ms minimum. That’s not great, it means my 2ms kernel often gets overslept and I lose throughput. Or, i can use EventSynchronize, and the CUDA drivers poll with 100%, and now my CPU heat and power is maxed out all day.

Feature request to nvidia: WaitForEvent(). Sleep the CPU until the event triggers. It may have more latency than the current Syncronize() commands, but hopefully better resolution than a manual Sleep() poll loop.

What’s the shortest nano-sleep in Linux? Windows’s 1ms is way too coarse.


In Linux 2.6, it looks like 1 ms is also the limit for most systems. This blog post mentions that shorter intervals can be obtained by using the real-time clock device:


It’s not portable like nanosleep(), since you have to read from /dev/rtc.

That definitly is outdated. While I can’t vouch for its accuracy, this one sounds more up-to-date:

Linux Highres timers

Oh, hey, that’s nice to see. I assume that patch made it into one of the later CUDA-supported Linux distributions.

For 2.0b2 I think so, newer Ubuntu versions should have it. Personally, I am using a newer, self-compiled kernel with SuSE 10.2 and it works fine too, except that you might have to integrate a patch into the driver installer for older driver versions.