Putting the host process to sleep during kernel execution

Now that i switched to a multithreaded host app to support multiple GPUs, the problem is back.
cpu usage goes up from 4% to 25% per thread,
if i do a sleep() in the host thread before calling cudaThreadSynchronize(), it stays at 4%.
If i only use one card with the same multithreaded code, cpu usage always stays at
4% (i.e. without calling sleep). i set cudaDeviceBlockingSync before cudaSetDevice() right after host thread creation.
cudaDeviceScheduleYield is even worse on the cpu.

is there an issue with blocking synchronization in a multithreaded program? order of initialization?
thanks for your help.