Async synchronization dramatically loads CPU

I have such task: async copy from Host to Device, on-device computing and async copy back from Device to Host. This task uses stream techique, as recomended by NVidia.
This task works in loop (starts about every 20 ms). I use cudaStreamSynchronize() to check the moment of finishing stream operations, but CPU load still high.
Manual says that CPU thread will block until the stream is finished, but it seems that CPU constantly checks some flag or semaphore or something else.
How can I low the CPU load but still having synchronization between Host and Device? Is there something like waitforsingleobject() and SetEvent() in CUDA?