How to calculate appropriate value for CUDA/CUVID's CUVIDDECODECREATEINFO::ulNumDecodeSurfaces used in cuvidCreateDecoder()

I’ve been working on implementing CUVID-based decoder, and found that correctness/quality of decoded frames greatly depends on the value of CUVIDDECODECREATEINFO::ulNumDecodeSurfaces member when calling cuvidCreateDecoder. I am unable to, however, discover any decent documentation describing how that value is used by decoder and how to appropriately calculate the necessary value.

To provide context, the decoder I’m working on is used to decode a live H264 stream, but the live stream being decoded is sometimes comprised of live source sampling and at other times injects pre-recorded video content. Think of it as capturing live stream from a camera, but sometimes cutting over and injecting a pre-recorded advertisement, which sort of “injects” H264 packets directly into encoded live stream (rather than being transcoded into live stream). What I found is that often when such content arrives to decoder, the frames do not always get appropriately decoded and there are some decoding artifacts in the decoded frames.

When the value is too low (originally was at 2), there are a lot of pixels that are incorrect (as if some P-frames got applied, and other P-frames dropped), jitter in motion, missing area updates, wrong pixel colors, etc (i.e. not a quality drop as if resolution reduced or quantization parameter got increased). When the value is higher (20 and even 25), most artifacts are gone, motion is smooth, good quality video. And yet, in some frames I still briefly notice occasional small area of incorrectly decoded pixels or some subsection of frame had motion inconsistent with the rest of the frame content.

In researching what exactly does ulNumDecodeSurfaces stands for, I don’t see much explanation on implications of values for the decoding pipeline and resulting output, beyond that what the name suggests - a number of some sort of surfaces. I also found a few posts / codes on the internet that seem to use different approaches. FFmpeg sets it to hardcoded constant #define MAX_FRAME_COUNT 25 and assigns that value directly to the member. CUVID’s own sample code does something even weirder, snippet reproduced below:

assert(cudaVideoChromaFormat_Monochrome == rVideoFormat.chroma_format ||
           cudaVideoChromaFormat_420        == rVideoFormat.chroma_format ||
           cudaVideoChromaFormat_422        == rVideoFormat.chroma_format ||
           cudaVideoChromaFormat_444        == rVideoFormat.chroma_format);
    // Fill the decoder-create-info struct from the given video-format struct.
    memset(&oVideoDecodeCreateInfo_, 0, sizeof(CUVIDDECODECREATEINFO));
    // Create video decoder
    oVideoDecodeCreateInfo_.CodecType           = rVideoFormat.codec;
    oVideoDecodeCreateInfo_.ulWidth             = rVideoFormat.coded_width;
    oVideoDecodeCreateInfo_.ulHeight            = rVideoFormat.coded_height;
    oVideoDecodeCreateInfo_.ulNumDecodeSurfaces = FrameQueue::cnMaximumSize;
    // Limit decode memory to 24MB (16M pixels at 4:2:0 = 24M bytes)
    // Keep atleast 6 DecodeSurfaces
    while (oVideoDecodeCreateInfo_.ulNumDecodeSurfaces > 6 && oVideoDecodeCreateInfo_.ulNumDecodeSurfaces * rVideoFormat.coded_width * rVideoFormat.coded_height > 16 * 1024 * 1024)
    oVideoDecodeCreateInfo_.ChromaFormat        = rVideoFormat.chroma_format;
    oVideoDecodeCreateInfo_.OutputFormat        = cudaVideoSurfaceFormat_NV12;

Their FrameQueue::cnMaximumSize is set to 20U (which I tried, and seems insufficient as artifacts do pop up, even at 25). But I find more puzzling questions looking at their code, rather than answers:

  1. assert-ion check allows for 4:2:0, 4:2:2, and 4:4:4 formats, which have different pixel widths, yet comment explicitly states 4:2:0 assumption in memory size calculations. Seems odd.
  2. What is the rationale behind limiting to 24MB? Why not more or less?
  3. Why at least 6 surfaces as minimum? Why no more than 20 surfaces are considered?
  4. Why inverse relationship between coded resolution and number of decode surfaces? With 24MB / 16Mp anchoring constant, lower resolution would allow for more decode surfaces. But why is number of decode surfaces a function of coded resolution in the first place? What is the underlying relationship?

I thought that number of decode surfaces had to do with B-frames and possibility of having to retain multiple reference frames as dictated by B-frames, but then how do I know how many reference fames an arbitrary video would want to retain at any one time? And how that would depend on resolution (i.e. question 4 above)?

CUVID’s Decoder Documentation is horribly beyond subpar and an iota above useless. Header file comment at declaration leaves much unexplained.

Any experts in the matter who can shed some light on what exactly that value controls, how it effects decoding, and how to calculate appropriate value for an arbitrary H264 stream? Anything that needs to be checked in on-sequence callback to reinitialize it to new value, and if so, again, how it should be calculated?

Great question, I am wondering the exact same thing in regards to ulNumDecodeSurfaces. Did you ever reach a conclusion on this?


Unfortunately no, I have not gotten any response here nor a duplicate question on SO. Both are dead in the water :(.

I was really hoping that NVIDIA engineers would see the question and respond to it, since I’m pretty sure they would know from implementation of H.264 in their cards and API.

I did, since, have a thought that ulNumDecodeSurfaces is somehow related to num_ref_frames, and even found another unanswered question in this forum that seems to support my suspicions of the relationship -

But I haven’t gone back and try to see if onSequence(…) would expose num_ref_frames and if re-initializing decoder mid-stream to change number of decode surfaces would work (like you would reinitialize when detecting resolution or chroma changes).

If you do wind up making the experiment, I’d be curious to know the results. When I come back to the problem, I’ll try to keep this thread updated with findings.

Keeping fingers crossed that perhaps NVIDIA would wake up and answer it :).

Hey NVIDIA!? What is the purpose of this forum if we don’t receive any help/answers on very important questions from YOU? I don’t see any reason for such a service if the stuck engineers should answer each other questions without involvement from those who author these APIs. And NVDEC documentation is so poor I can’t even belive how you release such a complex product without, at least, providing minimal documentation explaining some most important settings.

I may have found some answers regarding H.264. I believe ulNumDecodeSurfaces is the size (in frames) of the decoded picture buffer (DPB) that the decoder uses to store previously decoded pictures as reference for future pictures.

The H.264 standard specifies limits on the maximum DPB size a bitstream at a given level can require of a decoder.

From E.2.1:
max_dec_frame_buffering specifies the required size of the HRD decoded picture buffer (DPB) in units of frame buffers. It is a requirement of bitstream conformance that the coded video sequence shall not require a decoded picture buffer with size of more than Max( 1, max_dec_frame_buffering ) frame buffers to enable the output of decoded pictures The value of max_dec_frame_buffering shall be greater than or equal to max_num_ref_frames. An upper bound for the value of max_dec_frame_buffering is specified by the level limits in clauses A.3.1.h, A.3.2.f, G.10.2.1, and H.10.2.

In A.3.1.h and A.3.2.f the maximum value is calculated with the equation:
max_dec_frame_buffering <= Min( MaxDpbMbs / (PicWidthInMbs * FrameHeightInMbs ), 16) where PicWidthInMbs and FrameHeightInMbs are the width and height of the video in terms of macroblocks (so pixel resolution / 16 using maximum h.264 macroblock size) and MaxDpbMbs is given for each level in Table A-1. Table A-1 and some example calculations are given on wikipedia here:

So to support an arbitrary H.264 bitstream I would think you would be safe setting ulNumDecodeSurfaces to 16.

However, the author of the new NVDEC (8.1.24) sample code assigns the value 20 in NvDecoder.cpp:91 :

if (eCodec == cudaVideoCodec_VP9) {
        return 12;

    if (eCodec == cudaVideoCodec_H264 || eCodec == cudaVideoCodec_H264_SVC || eCodec == cudaVideoCodec_H264_MVC) {
        // assume worst-case of 20 decode surfaces for H264
        return 20;

    if (eCodec == cudaVideoCodec_HEVC) {
        // ref HEVC spec: A.4.1 General tier and level limits
        // currently assuming level 6.2, 8Kx4K
        int MaxLumaPS = 35651584;
        int MaxDpbPicBuf = 6;
        int PicSizeInSamplesY = (int)(nWidth * nHeight);
        int MaxDpbSize;
        if (PicSizeInSamplesY <= (MaxLumaPS>>2))
            MaxDpbSize = MaxDpbPicBuf * 4;
        else if (PicSizeInSamplesY <= (MaxLumaPS>>1))
            MaxDpbSize = MaxDpbPicBuf * 2;
        else if (PicSizeInSamplesY <= ((3*MaxLumaPS)>>2))
            MaxDpbSize = (MaxDpbPicBuf * 4) / 3;
            MaxDpbSize = MaxDpbPicBuf;
        return (std::min)(MaxDpbSize, 16) + 4;

For HEVC, the author calculates ulNumDecodeSurfaces using the max DPB size equations from the HEVC spec (section A.4.2) but adds +4 frames to the result.
For VP9, the author uses the value 12, which I believe comes from adding +4 to the constant NUM_REF_FRAMES in VP9 spec (section 3) (defined as “Number of frames that can be stored for future reference” and given value 8).

So, it seems that the author is taking the maximum DPB size allowed by each video format specification and adding 4 frames. My questions now are: what is the function of the extra decode surfaces? Are they a necessity or for performance? Why 4?

(I am referencing H.264 rec version (04/2017), HEVC rec version (02.2018), VP9 spec version 0.6)