CUDA context flags CU_CTX_SCHED_YIELD vs CU_CTX_BLOCKING_SYNC

Really simple question for anyone here (though I suspect the NV guys can give more info than the slightly lacking doco).

What exactly is the difference between CU_CTX_SCHED_YIELD and CU_CTX_BLOCKING_SYNC?
Or more specifically, what exactly does CU_CTX_BLOCKING_SYNC do?

The general impression I get from the doco is CU_CTX_SCHED_YIELD is similar to spin, except it’s yielding in between each poll - instead of constantly polling the device as fast as possible.

But how does this differ from CU_CTX_BLOCKING_SYNC? It doesn’t seem to imply it spins or yields, almost as if it’s referring to completely different concepts - the doco almost makes it seem like spin/yield schedling, and blocking syncs refer to two different types of device polls (waiting for synchronous kernel launches / memory copies, versus waiting for streams or events).

Really simple question for anyone here (though I suspect the NV guys can give more info than the slightly lacking doco).

What exactly is the difference between CU_CTX_SCHED_YIELD and CU_CTX_BLOCKING_SYNC?
Or more specifically, what exactly does CU_CTX_BLOCKING_SYNC do?

The general impression I get from the doco is CU_CTX_SCHED_YIELD is similar to spin, except it’s yielding in between each poll - instead of constantly polling the device as fast as possible.

But how does this differ from CU_CTX_BLOCKING_SYNC? It doesn’t seem to imply it spins or yields, almost as if it’s referring to completely different concepts - the doco almost makes it seem like spin/yield schedling, and blocking syncs refer to two different types of device polls (waiting for synchronous kernel launches / memory copies, versus waiting for streams or events).

So!

SCHED_YIELD does exactly what you think it does–it spins, and while it spins, it calls yield.

BLOCKING_SYNC does something totally different. It doesn’t spin at all, it just goes to sleep. When the GPU is done with a piece of work, it sends an interrupt that eventually makes it way back to the thread calling cudaThreadSynchronize(). This adds some additional latency, but if you’re waiting for a long period and don’t mind some latency, there’s zero CPU utilization for waiting threads.

So!

SCHED_YIELD does exactly what you think it does–it spins, and while it spins, it calls yield.

BLOCKING_SYNC does something totally different. It doesn’t spin at all, it just goes to sleep. When the GPU is done with a piece of work, it sends an interrupt that eventually makes it way back to the thread calling cudaThreadSynchronize(). This adds some additional latency, but if you’re waiting for a long period and don’t mind some latency, there’s zero CPU utilization for waiting threads.

Great! :)

So I’d be right in assuming CU_CTX_SCHED_* and CU_CTX_BLOCKING_SYNC are all mutually exclusive (despite having slightly different names) - and all relate to synchronization of ‘anything’ (synchronous kernel/memcpy launches / explicit context/stream synchronization / any synchronizations)?

Great! :)

So I’d be right in assuming CU_CTX_SCHED_* and CU_CTX_BLOCKING_SYNC are all mutually exclusive (despite having slightly different names) - and all relate to synchronization of ‘anything’ (synchronous kernel/memcpy launches / explicit context/stream synchronization / any synchronizations)?