The documentation concerning cudaDeviceSynchronize seems to make a difference if the flag cudaDeviceScheduleBlockingSync is set or not.
What the difference between the two ?
Is “completed all preceding requested tasks” different from “device has finished its work” ?
If it the same does cudaDeviceScheduleBlockingSync affect what is blocked, because the second specify “host thread” and the first don’t ?
And finally, if the second block the “host thread”, what is blocked without cudaDeviceScheduleBlockingSync flag set ?
Or maybe the difference is that the with cudaDeviceScheduleBlockingSync flag set the host thread calling cudaDeviceSynchronize will block for the device to finish all its work, including requests from others host threads to the device. And without the flag the host thread block only until its requested calls are completed.