What exactly are the differences/advantages/disadvantages of using cudaEventRecord+cudaEventQuery versus using cudaStreamQuery? From the CUDA documentation it seems that both of these approaches could be used (after launching async memcpy and kernel calls in a stream) to determine whether a stream has finished all of its work. The only thing I see mentioned in the programming guide is this somewhat cryptic statement from section 22.214.171.124.4 (Implicit Synchronization):
For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete: Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing; Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete. Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to <b><u>cudaStreamQuery()</u></b> on that stream. Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution: All independent operations should be issued before dependent operations, Synchronization of any kind should be delayed as long as possible.
That would seem to suggest that using events would be preferred, but it is never explicitly stated that cudaEventQuery performs differently.
Also, on windows is there any advantage to using cudaEventSynchronize (with cudaEventBlockingSync) instead of cudaEventQuery+Sleep(0) in a loop or cudaStreamSynchronize instead of cudaStreamQuery+Sleep(0) in a loop?