NVDEC How to multihread decoding for better performance (lower latency)?

elb75 · April 12, 2020, 2:54am

Hello,
I have a C++ application that decodes and displays (via D3D) an h.264 encoded stream of 2048x2048x15fps video.

My top priority is low latency (minimal lag between image acquisition on camera → encoding ->network transmission → decoding → image display on PC)

I have the application working with a ~280-320ms latency using ffmpeg parser+decoder (using CPU only, no GPU involved).

I am experimenting with the NVDEC, hoping to reduce the latency further. I have adapted your AppDecD3D into our application’s pipeline, and am successfully using the ffmpeg parser and the NVDEC decoder (via the NvDecoder class in the example)

However, the latency of our display has increased to ~380-450ms (with the exact same video stream, containing only I and P frames, no B frames).

I have set the bLowLatency flag in the NvDecoder constructor, and set the flag CUVID_PKT_ENDOFPICTURE in the call to NvDecoder::Decode(), and, I tried setting CUVIDPARSERPARAMS.ulMaxNumDecodeSurfaces from 1 to 3.

The NVDEC Programming Guide pdf has a section for ‘4.8 Writing an efficient application,’ which suggests separate threads be used for cuvidDecodePicture and cuvidMapVideoFrame. Yet, the NvDecoder class seems to make both HandlePicture*() callbacks on the same thread.

Could you please:

a) let me know if latency performance on par (or hopefully better than) ffmpeg CPU decode can be expected with additional work on a multithreaded pipeline?

and

b) Point me to some resources on how to implement the multithreaded approach described in the programming guide pdf. I have not found good information on how to call cuvidDecodePicture and cuvidMapVideoFrame from separate threads, and ensure that I am being thread safe, yet still achieving performance gains.

I’d be happy to supply example stream data, or give you any additional information that will help answer this. Time is somewhat of the essence on resolving this.

Thank you in advance,
-Eric

test setup:
Win10, Cuda10.1, NVDEC SDK 9.1.23, Quadro P2000, VS2019

elb75 · April 15, 2020, 2:37pm

any thoughts? Thanks.

Ext3h · April 21, 2020, 9:03am

No, you can’t really expect any better latency. Just some empiric insight into the inner workings:

CUVID_PKT_ENDOFPICTURE can save you 1 frame of latency, as decoding may start before next frame is being received. (Otherwise, decoding will only start with 1 frame delay.)

bLowLatency only appears to disable driver side frame pacing, but otherwise it does not appear to have any effect at all.

Increasing CUVIDPARSERPARAMS.ulMaxNumDecodeSurfaces should match your sequence format, plus 3-4 surfaces extra. That’s just for throughput though, not for latency.

If you care about latency, what you want to do is to ignore the PictureDisplay callback alltogether, and just poll on cuvidGetDecodeStatus. That requires that you are absolutely sure that decode order matches render order!

Because that’s where a big part of the latency originates from. PictureDisplay only comes once the parser is certain that there can’t be any B-frames missing, which would be in display-order before, but in decode-order long after an already fully decoded frame. At that point, cuvidGetDecodeStatus has long indicated that the frame had been successfully decoded, and you may safely call cuvidMapVideoFrame already.

Unfortunately, the parser is a stupid black box, which you can’t hint to that your sequence format actually contains no B-frames whatsoever.

If you want to do the mapping in a different thread, beware that you are likely to shoot yourself in your foot.
If you do that, you must synchronize the call to cuvidMapVideoFrame on the next call to cuvidDecodePicture, with regard to CUVIDPICPARAMS.CurrPicIdx.

Once cuvidMapVideoFrame has completed, you are safe. It’s internally performing a copy from the decoding surface into a distinct output surface. So you can keep the frame safely mapped for a longer duration.

elb75 · April 21, 2020, 8:20pm

Thank you for the detailed reply. I will try polling cuvidGetDecodeStatus in the next day or two and see what results I get. I may have follow up questions at that time.
-Eric.

Topic		Replies	Views
Video decoder frames latency between first frame inserted and first frame extracted Video Processing & Optical Flow	9	3313	September 28, 2021
Issue with m_bForce_zero_latency(force_zero_latency) option NvDecode.cpp Video Processing & Optical Flow decoder , video	8	1200	December 12, 2022
NVDEC - Post decode performance issue Video Processing & Optical Flow	6	1380	May 14, 2020
NVDEC hardware CUDA_ERROR_INVALID_VALUE cuvidDecodePicture call CUDA Programming and Performance	0	2631	March 28, 2019
Can start to parse video in specific frame(not the begin) of the video with codec library? Video Processing & Optical Flow	17	1990	October 12, 2021
Debugging slow NVDEC h264 decoding using FFMPEG -- time is spent in avcodec_send_packet() Video Processing & Optical Flow	8	709	September 30, 2024
NVDec did not output the correct sequence of bitstream parsing and decoding Video Processing & Optical Flow nvbugs	8	1166	June 1, 2023
NVENC HEVC ultra low latency with FFmpeg libraries, what should be my expectations? Video Processing & Optical Flow	5	3660	July 27, 2020
Expected performance gain Video Processing & Optical Flow	1	74	December 7, 2024
Video SDK decoder or encoder have always 5 frames Buffer DPB buffer or some other frame buffer Video Processing & Optical Flow	10	2926	June 24, 2022

NVDEC How to multihread decoding for better performance (lower latency)?

Related topics