cuda kernel will return asynchronously except you do “cudaDeviceSynchronize” after it.
You can try to use cudaLaunchHostFunc() OR cudaStreamAddCallback() to add callback
If so, the callback is a CPU function and the function to call once preceding stream operations are complete
Host Functions (Callbacks) use cudaLaunchHostFunc() instead of the deprecated cudaStreamAddCallback() now.
And both of them have the same behavior.
The commands that are issued in a stream after a host function do not start executing before the function has completed.