Occasional Deadlock On cuMemcpyHtoDAsync/nvEncMapInputResource With NVENC

I have an app that receives images and operates on those images via cuda before compression into an h264 video stream that’s saved to a file.
This works, but after a variable (generally between 5 and 1000) number of frames there’s a deadlock.

The structure of the app is:

  1. images are async uploaded into the GPU
  2. CUDA kernels are queued
  3. A callback is registered for when the kernel completes
  4. Once the callback is fired, the finished cuda frame is put in a thread safe queue where a dedicated thread passes the cuda frame to NVENC to be synchronously compressed.
  5. If my app receives a new frame before the compression of the previous frame is finished, the new frame is uploaded into a different cuda stream

When the deadlock occurs, the first thread freezes on “cuMemcpyHtoDAsync” and the compression thread freezes on “nvEncMapInputResource”. To me, this suggests that some addition synchronization is needed between CUDA and NVENC when cuda is using more than one stream. The only relevant mention I can find to this is https://devtalk.nvidia.com/default/topic/791948/gpu-accelerated-libraries/nvenc-and-synchronization/ but despite being a four year old thread there doesn’t seem to be any consensus.

Does anyone have any suggestions for what additional synchronization is needed between CUDA and NVENC?
Does the compression thread need to have the cuda context pushed before interacting with NVENC?
Any and all advice/tips for this would be helpful, as I’ve been stuck on this problem for a while now.

Ok, I believe that I figured out the deadlock issue. My cuda callback would occasionally be locked waiting for the frame handler which in turn would occasionally be locked waiting for the compression thread which would in turn occasionally be locked waiting for the cuda callback to complete. As a result there was a mexican standoff between the three threads.