CUDA Callback function context

What is the context (thread) in which callback functions registered with cudaStreamAddCallback are called? Is it different than the main program thread? Is there a way to wait for any callback to execute without consuming CPU time?

Yes, its a different thread than the main program. It is run in a thread created by the CUDA driver.

[url]cuda - What thread runs the callback passed to cudaStreamAddCallback? - Stack Overflow

Since the complete context is not described, this should be treated as an abstract, implementation-defined methodology for handling a callback. Therefore I would be cautious about trying to create direct synchronization between it and your program code.

The way to wait for a callback in a CUDA-aware way would be to put an event into the stream the callback was issued into after the callback, then issue cudaEventSynchronize() (on that event) prior to the code you want to wait on the callback. This would be the programming-model-aware method, in my view.

At that point, the question about whether or not that cudaEventSynchronize() call uses CPU time would be a function of how you have the synchronization flags set.

[url]https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g69e73c7dda3fc05306ae7c811a690fac[/url]

Thanks, very helpful.