Video capture software we’re working on is failing about 5-10% of the time when creating the capture session, and we’re at a loss for how to proceed debugging it.
Context
For context, our video capture code uses the newer API with NVFBC v8.0.4 to capture to GL. We then pass the captured frames to NVENC and handle the h264-encoded frames ourselves. Our program is multithreaded, with a main thread and separate threads for video and audio capture. Our NVFBC implementation is almost verbatim from NvFBCToGLEnc.c
from the samples included with the latest NVIDIA Capture SDK. We run this on EC2 instances in AWS that are running Linux 5.4 with Ubuntu 20.04.
We’ve tried this on pretty much every valid combination of the following, with identical results:
- Driver version 440/450/460
- Old, deprecated NVFBC-NVENC API/New NVFBC-GL API
- g3s.xlarge/g4dn.xlarge/g4dn.2xlarge EC2 instances
- NVIDIA Capture SDK Headers v7.1.1/v8.0.4
Issue
Under certain conditions (resolution updates, for example), we need the capture session to update to the new state. NVFBC should automatically take care of this, but oftentimes it would hang for 10-12 seconds before returning with an error string vkCreateDevice failed: -3
or -4
(that’s VK_ERROR_INITIALIZATION_FAILED
or VK_ERROR_DEVICE_LOST
).
We tried circumventing this issue by destroying the capture session, updating the resolution, and then recreating the capture session, but still got this spurious failure – now the session creation nvFBCCreateCaptureSession
sometimes hangs for 10-12 seconds before failing with vkCreateDevice failed: -3/-4
.
To reduce the problem to its simplest state, we now create a capture session, capture a few frames, and destroy the capture session in a loop, with no resolution resizing. (Here’s a very hacky example modified from NvFBCToGLEnc.c
: nvfbc_loop.c - Pastebin.com – build it according to the makefile from the Capture SDK samples.). When we run this code directly, the capture session is repeatedly recreated with no problems at all (even running it at a rapid pace for up to 20 minutes!).
However, when we run this code in the video thread of our program, while leaving the rest of the program (main thread, audio thread, etc.) alive, then we start to see the spurious failures again, sometimes even on the first creation!
This could suggest that there’s something that we’re not understanding about running NVFBC in multithreaded applications, even though the documentation says it should be perfectly safe.
I’d really appreciate pointers for how to proceed debugging this, as we’re at the end of our wits here. It’s especially annoying because we can test something successfully 5 or 6 times in a row, and then get our hearts broken by the error on the next run </3
Happy to provide any additional information that would be helpful. Thanks!