Interaction between `cuStreamDestroy` and `cuvidMapVideoFrame`

I have a series of frames being output from an instance of NVDEC. Every single frame receives its own cuStream. Each of those streams end up with several operations enqueued on them:

  1. cuMemAllocFromPoolAsync – Allocate my own memory for the frame
  2. cuvidMapVideoFrame64 – Map the frame
  3. cuMemcpyAsync – Copy out of the frame into owned memory
  4. cuRecordEvent – Mark that the copy has completed, and the memory may be used
  5. cuvidUnmapVideoFrame – Unmap the frame

My question has two parts:

Part one:
In some cases, I realize part-of-the-way through processing a frame that I will not use it. Ideally, I’d be able to avoid wasting decoder time, and cuStreamDestroy the stream to prevent it from executing. I have two fears with this approach, though:

Fear 1: cuvidUnmapVideoFrame will never be called. Despite the facts that the docs clearly state:

In case the device is still doing work in the stream hStream when cuStreamDestroy() is called, the function will return immediately and the resources associated with hStream will be released automatically once the device has completed all work in hStream.

I am not confident that an NVDEC output surface will be correctly tracked by the stream, as it’s from a discrete API, and the lifetime management is clearly the user’s responsibility. Even if I measure that it does get freed, I’d like to be confident that this is guaranteed behavior, and it will not change in future CUDA or NVDEC releases.

Fear 2: How can I manage the lifetime of the memory created from AllocFromPoolAsync. Even if I added an event between steps 1 and 2 to verify that the pointer has been allocated, I can never free the memory because the stream may still be processing. If I call cuStreamDestroy, and then call FreeAsync, it’s possible that FreeAsync completes before the cuStreamDestroy, and then the still-running stream preforms an illegal memory access.

Are these fears well-founded? Are there any synchronizations methods I could employ to mitigate them?

Part two: Priority Inversion?
If a high-priority stream is cuWaitEvent-ing an event from a low-priority stream, is the priority of the producing stream automatically elevated? Or must I manually set the attribute on the producing stream. Does the stream scheduler even work on small enough time scales to make priority tweaking worthwhile for my usecase?

I can’t speak to the mechanics of cuvid. NVDEC/NVENC questions belong on another forum.

However, this statement doesn’t look right to me:

If you have issued work into a stream, cuStreamDestroy does not prevent that work from executing, In fact I know of no way to prevent that work from executing, barring pathological approaches like a device reset or something equally tragic like a kernel with an assert or trap in it.

Managing the lifetime of memory allocated from stream-ordered memory allocator doesn’t seem like it should be an issue. Issue a cuFreeAsync at the point in the stream processing when you know it won’t be needed. All previous work needing it (issued in that stream) should complete before it gets freed.

No it is not. You also can’t adjust the priority of work already issued into a particular stream. But a high-priority stream stuck at a cuStreamWaitEvent call is not issuing work to the GPU or consuming execution resources, so that by itself shouldn’t hinder progress of another stream, from what I can see. Stream priority really only affects the block selection order for the CUDA Work Distributor (block scheduler) and that only applies to kernels already launched.