cudaDeviceSynchronize blocking effect cudaDeviceScheduleBlockingSync


The documentation concerning cudaDeviceSynchronize seems to make a difference if the flag cudaDeviceScheduleBlockingSync is set or not.

What the difference between the two ?

Is “completed all preceding requested tasks” different from “device has finished its work” ?

If it the same does cudaDeviceScheduleBlockingSync affect what is blocked, because the second specify “host thread” and the first don’t ?

And finally, if the second block the “host thread”, what is blocked without cudaDeviceScheduleBlockingSync flag set ?


Or maybe the difference is that the with cudaDeviceScheduleBlockingSync flag set the host thread calling cudaDeviceSynchronize will block for the device to finish all its work, including requests from others host threads to the device. And without the flag the host thread block only until its requested calls are completed.

The difference is between “block”, “yield”, or “spin”. In the default “spin” setting the host thread enters a close busy-waiting loop, so it consumes 100% CPU cycles while waiting for the device to finish (unless the CPU scheduler yields to CPU to a different thread). In the “yield” setting, the busy-waiting loop includes an OS call to actively yield the CPU to other threads, while in the “block” setting the host thread sleeps using 0% CPU until the GPU becomes ready again.

Thanks for your answer.

According to you the cudaDeviceSynchronize function should be documented as:

i have almost similar question
when to use the cudaDeviceSynchronize?
let say i have 2 device that split a task of simple array sum… should i use it after the kernel launch?