Wait for all (I mean all) streams to finish on a device

I’m profiling a complex application that I didn’t write and don’t have much control over. I want to issue a call akin to cudaDeviceSynchronize() or cudaStreamSynchronize(0) to wait for ALL streams created by the app that are currently in flight on the GPU, so that I can measure the wall-clock time elapsed between two locations in the code. Is it possible to do this?

The current context isn’t set to the current device and the driver API is used. There is only one GPU. Please don’t tell me to use a profiler.

Thanks for any info.