NVENC HEVC ultra low latency with FFmpeg libraries, what should be my expectations?

I’m developing an application that captures raw video frames from a live source, and then uses the FFmpeg libraries (calls to av_*** etc) to filter and then encode the frames, with the encoding done using NVENC. I’m using latest FFmpeg code together with the latest NVENC SDK 10.

I am trying to reduce the latency down to the absolute minimum so I’m using the “ultra low latency” tuning info which I understand is supposed to pre-configure the more detailed settings to favour low latency.

As the trace snippet below shows, I find that I need to submit three uncompressed frames to NVENC before getting back the first encoded frame. Thereafter, the output is always two frames behind the input, which in a 50fps stream is a sizeable 40ms delay. Is this expected, or is there a way to reduce the latency down further? Ideally I’d submit one frame and be able to read back an encoded one right away, i.e. the compression algorithm would only be “looking back” rather than “looking forward” (and thus having to wait for more frames).

Perhaps it’s an issue with FFmpeg, but before I go down that road, I wanted to ask the developers here what should be possible with NVENC in this respect, what is the minimum latency achievable with the appropriate settings, when using HEVC?

24/07/20 15:21:30.6209305 : MWDEVICE_VIDEO_SERVER_CAPTURE: Captured video frame index 1000 with pts 33865544733947
24/07/20 15:21:30.6210664 : VIDEO_SERVER_TRANSCODER_STREAM_SESSION: Decoded video frame index 1000 with pts 33865544733947
24/07/20 15:21:30.6226942 : VIDEO_SERVER_TRANSCODER_STREAM_SESSION: Filtered video frame index 1000 with pts 169327723
24/07/20 15:21:30.6418622 : MWDEVICE_VIDEO_SERVER_CAPTURE: Captured video frame index 1001 with pts 33865544732114
24/07/20 15:21:30.6419119 : VIDEO_SERVER_TRANSCODER_STREAM_SESSION: Decoded video frame index 1001 with pts 33865544933947
24/07/20 15:21:30.6435439 : VIDEO_SERVER_TRANSCODER_STREAM_SESSION: Filtered video frame index 1001 with pts 169327724
24/07/20 15:21:30.6608526 : MWDEVICE_VIDEO_SERVER_CAPTURE: Captured video frame index 1002 with pts 33865544932106
24/07/20 15:21:30.6609842 : VIDEO_SERVER_TRANSCODER_STREAM_SESSION: Decoded video frame index 1002 with pts 33865545133947
24/07/20 15:21:30.6626590 : VIDEO_SERVER_TRANSCODER_STREAM_SESSION: Filtered video frame index 1002 with pts 169327725
24/07/20 15:21:30.6631337 : VIDEO_SERVER_TRANSCODER_STREAM_SESSION: Encoded video frame index 1000 with pts 304789901400 (was 169327723)

It is weird result if there is no other performance bottleneck (for example other 100 streams encoded in parallel, using B-Frames, card is in “powersaving” (>P0 state), encoded image is huge (like 8K)…). I am using NVENC SDK directly many years with “low-latency” preset and with I+P frames only and usual one-frame encoding delay is about 1ms for 1080p.

I am going to investigate it further inside the FFmpeg code, it’s the only encode taking place but the delay seems to be caused by the fact it lags two frames behind ie it only emits the first encoded output frame after the third input frame is provided, even with B frames disabled.

I am going to add some logging to FFmpeg and try and figure out why it doesn’t emit a frame sooner.

Without B frames I assume it should provide one encoded frame for each input frame?

One encoded frame is produced for every I-P-B but in case of B-frames in “delayed” order.

Right so in the case of I-P-P mode I should get back the first encoded frame after submitting the first raw frame, the second encoded frame after submitting the second raw frame, etc, with the only delay being the execution time of the hardware to encode a frame?

If this is correct, and assuming it works fine when using the NVENC SDK directly then maybe it’s an issue with FFmpeg either incorrectly setting some configuration options or doing its own buffering or something.

Ok for the benefit of any Googlers of the future, the issue is caused by the “delay” parameter (internally referred to as async_depth) passed to FFmpeg for nvenc. The code in FFmpeg’s nvenc.c waits until this number of extra frames have been buffered before emitting any and is intended to support parallel encoding.

The problem is that its default value is INT_MAX which later gets reduced down to the number of NVENC surfaces initialised minus one, which itself has an initial value of 4 in this scenario.

It seems a bit strange that the default is not for the (surely?) more common case of non-parallel encoding but there you go.