I’ve been working on implementing CUVID-based decoder, and found that correctness/quality of decoded frames greatly depends on the value of CUVIDDECODECREATEINFO::ulNumDecodeSurfaces
member when calling cuvidCreateDecoder
. I am unable to, however, discover any decent documentation describing how that value is used by decoder and how to appropriately calculate the necessary value.
To provide context, the decoder I’m working on is used to decode a live H264 stream, but the live stream being decoded is sometimes comprised of live source sampling and at other times injects pre-recorded video content. Think of it as capturing live stream from a camera, but sometimes cutting over and injecting a pre-recorded advertisement, which sort of “injects” H264 packets directly into encoded live stream (rather than being transcoded into live stream). What I found is that often when such content arrives to decoder, the frames do not always get appropriately decoded and there are some decoding artifacts in the decoded frames.
When the value is too low (originally was at 2), there are a lot of pixels that are incorrect (as if some P-frames got applied, and other P-frames dropped), jitter in motion, missing area updates, wrong pixel colors, etc (i.e. not a quality drop as if resolution reduced or quantization parameter got increased). When the value is higher (20 and even 25), most artifacts are gone, motion is smooth, good quality video. And yet, in some frames I still briefly notice occasional small area of incorrectly decoded pixels or some subsection of frame had motion inconsistent with the rest of the frame content.
In researching what exactly does ulNumDecodeSurfaces
stands for, I don’t see much explanation on implications of values for the decoding pipeline and resulting output, beyond that what the name suggests - a number of some sort of surfaces. I also found a few posts / codes on the internet that seem to use different approaches. FFmpeg sets it to hardcoded constant #define MAX_FRAME_COUNT 25 and assigns that value directly to the member. CUVID’s own sample code does something even weirder, snippet reproduced below:
assert(cudaVideoChromaFormat_Monochrome == rVideoFormat.chroma_format ||
cudaVideoChromaFormat_420 == rVideoFormat.chroma_format ||
cudaVideoChromaFormat_422 == rVideoFormat.chroma_format ||
cudaVideoChromaFormat_444 == rVideoFormat.chroma_format);
// Fill the decoder-create-info struct from the given video-format struct.
memset(&oVideoDecodeCreateInfo_, 0, sizeof(CUVIDDECODECREATEINFO));
// Create video decoder
oVideoDecodeCreateInfo_.CodecType = rVideoFormat.codec;
oVideoDecodeCreateInfo_.ulWidth = rVideoFormat.coded_width;
oVideoDecodeCreateInfo_.ulHeight = rVideoFormat.coded_height;
oVideoDecodeCreateInfo_.ulNumDecodeSurfaces = FrameQueue::cnMaximumSize;
// Limit decode memory to 24MB (16M pixels at 4:2:0 = 24M bytes)
// Keep atleast 6 DecodeSurfaces
while (oVideoDecodeCreateInfo_.ulNumDecodeSurfaces > 6 && oVideoDecodeCreateInfo_.ulNumDecodeSurfaces * rVideoFormat.coded_width * rVideoFormat.coded_height > 16 * 1024 * 1024)
{
oVideoDecodeCreateInfo_.ulNumDecodeSurfaces--;
}
oVideoDecodeCreateInfo_.ChromaFormat = rVideoFormat.chroma_format;
oVideoDecodeCreateInfo_.OutputFormat = cudaVideoSurfaceFormat_NV12;
Their FrameQueue::cnMaximumSize
is set to 20U
(which I tried, and seems insufficient as artifacts do pop up, even at 25). But I find more puzzling questions looking at their code, rather than answers:
-
assert
-ion check allows for 4:2:0, 4:2:2, and 4:4:4 formats, which have different pixel widths, yet comment explicitly states 4:2:0 assumption in memory size calculations. Seems odd. - What is the rationale behind limiting to 24MB? Why not more or less?
- Why at least 6 surfaces as minimum? Why no more than 20 surfaces are considered?
- Why inverse relationship between coded resolution and number of decode surfaces? With 24MB / 16Mp anchoring constant, lower resolution would allow for more decode surfaces. But why is number of decode surfaces a function of coded resolution in the first place? What is the underlying relationship?
I thought that number of decode surfaces had to do with B-frames and possibility of having to retain multiple reference frames as dictated by B-frames, but then how do I know how many reference fames an arbitrary video would want to retain at any one time? And how that would depend on resolution (i.e. question 4 above)?
CUVID’s Decoder Documentation is horribly beyond subpar and an iota above useless. Header file comment at declaration leaves much unexplained.
Any experts in the matter who can shed some light on what exactly that value controls, how it effects decoding, and how to calculate appropriate value for an arbitrary H264 stream? Anything that needs to be checked in on-sequence callback to reinitialize it to new value, and if so, again, how it should be calculated?