I have a C++ application that decodes and displays (via D3D) an h.264 encoded stream of 2048x2048x15fps video.
My top priority is low latency (minimal lag between image acquisition on camera -> encoding ->network transmission -> decoding -> image display on PC)
I have the application working with a ~280-320ms latency using ffmpeg parser+decoder (using CPU only, no GPU involved).
I am experimenting with the NVDEC, hoping to reduce the latency further. I have adapted your AppDecD3D into our application’s pipeline, and am successfully using the ffmpeg parser and the NVDEC decoder (via the NvDecoder class in the example)
However, the latency of our display has increased to ~380-450ms (with the exact same video stream, containing only I and P frames, no B frames).
I have set the
bLowLatency flag in the
NvDecoder constructor, and set the flag
CUVID_PKT_ENDOFPICTURE in the call to
NvDecoder::Decode(), and, I tried setting
CUVIDPARSERPARAMS.ulMaxNumDecodeSurfaces from 1 to 3.
The NVDEC Programming Guide pdf has a section for ‘4.8 Writing an efficient application,’ which suggests separate threads be used for
cuvidMapVideoFrame. Yet, the NvDecoder class seems to make both
HandlePicture*() callbacks on the same thread.
Could you please:
a) let me know if latency performance on par (or hopefully better than) ffmpeg CPU decode can be expected with additional work on a multithreaded pipeline?
b) Point me to some resources on how to implement the multithreaded approach described in the programming guide pdf. I have not found good information on how to call cuvidDecodePicture and cuvidMapVideoFrame from separate threads, and ensure that I am being thread safe, yet still achieving performance gains.
I’d be happy to supply example stream data, or give you any additional information that will help answer this. Time is somewhat of the essence on resolving this.
Thank you in advance,
Win10, Cuda10.1, NVDEC SDK 9.1.23, Quadro P2000, VS2019