Putting the host process to sleep during kernel execution

This does not work on windows (vista). I know my kernel takes 20ms, so i sleep 20ms on linux
before calling cudaThreadSynchronize which returns after 0-2 ms. When i do this on windows, i sleep 20ms and still spend
20ms in cudaThreadSynchronize (using 100% cpu in its spinlock).
Any ideas?

either wait on a blocking sync event or pass the blocking sync flag before context creation.

cudaSetDeviceFlags before cudaSetDevice, indeed. thanks a lot.

Now that i switched to a multithreaded host app to support multiple GPUs, the problem is back.
cpu usage goes up from 4% to 25% per thread,
if i do a sleep() in the host thread before calling cudaThreadSynchronize(), it stays at 4%.
If i only use one card with the same multithreaded code, cpu usage always stays at
4% (i.e. without calling sleep). i set cudaDeviceBlockingSync before cudaSetDevice() right after host thread creation.
cudaDeviceScheduleYield is even worse on the cpu.

is there an issue with blocking synchronization in a multithreaded program? order of initialization?
thanks for your help.