NVDEC How to multihread decoding for better performance (lower latency)?

I have a C++ application that decodes and displays (via D3D) an h.264 encoded stream of 2048x2048x15fps video.

My top priority is low latency (minimal lag between image acquisition on camera -> encoding ->network transmission -> decoding -> image display on PC)

I have the application working with a ~280-320ms latency using ffmpeg parser+decoder (using CPU only, no GPU involved).

I am experimenting with the NVDEC, hoping to reduce the latency further. I have adapted your AppDecD3D into our application’s pipeline, and am successfully using the ffmpeg parser and the NVDEC decoder (via the NvDecoder class in the example)

However, the latency of our display has increased to ~380-450ms (with the exact same video stream, containing only I and P frames, no B frames).

I have set the bLowLatency flag in the NvDecoder constructor, and set the flag CUVID_PKT_ENDOFPICTURE in the call to NvDecoder::Decode(), and, I tried setting CUVIDPARSERPARAMS.ulMaxNumDecodeSurfaces from 1 to 3.

The NVDEC Programming Guide pdf has a section for ‘4.8 Writing an efficient application,’ which suggests separate threads be used for cuvidDecodePicture and cuvidMapVideoFrame. Yet, the NvDecoder class seems to make both HandlePicture*() callbacks on the same thread.

Could you please:

a) let me know if latency performance on par (or hopefully better than) ffmpeg CPU decode can be expected with additional work on a multithreaded pipeline?


b) Point me to some resources on how to implement the multithreaded approach described in the programming guide pdf. I have not found good information on how to call cuvidDecodePicture and cuvidMapVideoFrame from separate threads, and ensure that I am being thread safe, yet still achieving performance gains.

I’d be happy to supply example stream data, or give you any additional information that will help answer this. Time is somewhat of the essence on resolving this.

Thank you in advance,

test setup:
Win10, Cuda10.1, NVDEC SDK 9.1.23, Quadro P2000, VS2019

any thoughts? Thanks.

No, you can’t really expect any better latency. Just some empiric insight into the inner workings:

CUVID_PKT_ENDOFPICTURE can save you 1 frame of latency, as decoding may start before next frame is being received. (Otherwise, decoding will only start with 1 frame delay.)

bLowLatency only appears to disable driver side frame pacing, but otherwise it does not appear to have any effect at all.

Increasing CUVIDPARSERPARAMS.ulMaxNumDecodeSurfaces should match your sequence format, plus 3-4 surfaces extra. That’s just for throughput though, not for latency.

If you care about latency, what you want to do is to ignore the PictureDisplay callback alltogether, and just poll on cuvidGetDecodeStatus. That requires that you are absolutely sure that decode order matches render order!

Because that’s where a big part of the latency originates from. PictureDisplay only comes once the parser is certain that there can’t be any B-frames missing, which would be in display-order before, but in decode-order long after an already fully decoded frame. At that point, cuvidGetDecodeStatus has long indicated that the frame had been successfully decoded, and you may safely call cuvidMapVideoFrame already.

Unfortunately, the parser is a stupid black box, which you can’t hint to that your sequence format actually contains no B-frames whatsoever.

If you want to do the mapping in a different thread, beware that you are likely to shoot yourself in your foot.
If you do that, you must synchronize the call to cuvidMapVideoFrame on the next call to cuvidDecodePicture, with regard to CUVIDPICPARAMS.CurrPicIdx.

Once cuvidMapVideoFrame has completed, you are safe. It’s internally performing a copy from the decoding surface into a distinct output surface. So you can keep the frame safely mapped for a longer duration.

1 Like

Thank you for the detailed reply. I will try polling cuvidGetDecodeStatus in the next day or two and see what results I get. I may have follow up questions at that time.