cudaStreamAddCallback not supported for multi processing service (MPS)

Running the simpleCallback sample fails with error 71(cudaErrorNotSupported), when the nvidia-cuda-mps-control daemon is running. When the daemon is not running the sample runs fine, so it is a compatibility issue with MPS.

Is there a reason this is not supported on mps? Are there plans to support it?
Is there a good alternative to using cudaStreamAddCallback?

Probably you know this already, but this limitation is documented:

Yes, I saw that.

Do you know of an alternative MPS compatible way to run host code once all enqueued items on a stream have completed?

Nothing elegant comes to mind.

You could enqueue a kernel into that stream that sets a zero-copy host variable. Then have a thread that spun on that variable and waited for a change. You’d have to decide how often you want to poll it. Once the change is detected, then run your particular host code.

You could also launch an event into that stream. Then in another stream, do a cudaStreamWaitEvent on that event, followed by the particular host code you wanted to run.