Debugging slow NVDEC h264 decoding using FFMPEG -- time is spent in avcodec_send_packet()

Hi all,

I am using NVDEC with FFMPEG to decode videos and I am seeing decoding time slower than multicore CPU decoding.

It appears that most of the time is spent in avcodec_send_packet. I captured a trace using nsys and attached is a screenshot demonstrating the issue.

The code looks like as follows:

void decodeOneFrame() {
  nvtx3::scoped_range function_range{"decodeOneFrame"};
  while (true) {
    int status = avcodec_receive_frame();
    if (status == 0) break;
    ... = av_read_frame(...);  // Read the packet from the file
    {
      nvtx3::scoped_range function_range{"avcodec_send_packet"};
      avcodec_send_packet(...);
    }
  }
}
// Do the color conversion from NV12 to RGB on the device itself using NPP.

The profile looks like following:

For some reason avcodec_send_packet takes a long time, and only at the very end of the function do I see a call to cuvidMapVideoFrame64, ConvertNV12BLtoNV12, cuMemcpy2DAsync twice (one for Y and one for UV I am assuming), and ForEachPixelNaiveBy4 (I am assuming color conversion).

All these functions take a fraction of time compared to avcodec_send_packet. Why is that the case? I also don’t see any interesting memory copy or kernel on the GPU timeline during 90% of the time of avcodec_send_packet(). Only the last 10% of avcodec_send_packet() actually does device memcopies and runs device kernels. How do I make this faster?

Note that I am profiling with:

nsys profile --trace nvvideo,cuda,nvtx

Am I missing something? How do I make this video decoder go faster? What is the cuvid h264 codec doing in the first 90% of avcodec_send_packet()?

EDIT: It appears I can’t see any H2D memcpy’s even though I can see D2D memcpy’s and memsets. Maybe the “whitespace” in avcodec_send_packet is due to H2D transfers but they are not captured by the profiler?

Are you joking? It is parsing the SEI, SPS, PPS, transforming the stream to ANNEX B, as legacy (well, still not deprecared) h264_cuvid require ANNEX B.

Try this: ffmpeg -i file.mp4 -c copy file.h264

Thank you for the reply @val.zapod.vz

My question was actually specific to the large amount of idle time that I see inside avcodec_send_packet where seemingly no work is done on the CPU or GPU (including memcopies). This could be an artifact of profiling itself, or perhaps I am doing something wrong here.

Here is a diagram to clarify my question. Please look at the text in red: “What is happening in this window?”

The main CPU thread is only running for a short amount of time. It is actually descheduled for most of the time as seen in the figure.

I actually collected some stack traces as well, and it appears to be servicing page faults in a lot of stack traces that were sampled.


pt_main_thread 3945124 2835920.743929:     541947 cycles: 
        ffffffff813657f7 clear_page_erms+0x7 (/boot/vmlinux-5.19.0-)
        ffffffff81019719 __folio_alloc+0xd99 (/boot/vmlinux-5.19.0-)
        ffffffff8108a8c7 vma_alloc_folio+0x217 (/boot/vmlinux-5.19.0-)
        ffffffff81087769 handle_mm_fault+0x559 (/boot/vmlinux-5.19.0-)
        ffffffff8107e31a do_user_addr_fault+0x1ba (/boot/vmlinux-5.19.0-)
        ffffffff81da9bdf exc_page_fault+0x5f (/boot/vmlinux-5.19.0-)
        ffffffff81e00b62 asm_exc_page_fault+0x22 (/boot/vmlinux-5.19.0-)
            7f5d2e56b47a __memset_evex_unaligned_erms+0x7a (/usr/lib64/libc.so.6)
            7f5c31afa2a9 ff_h264_alloc_tables+0x79 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31af043e h264_slice_header_init+0x2ee (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31af1d38 ff_h264_update_thread_context+0x2d8 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31cf097e update_context_from_thread+0x27e (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31cf0f44 ff_thread_decode_frame+0x224 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31a0bf82 decode_receive_frame_internal+0x4a2 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31a0c3a3 avcodec_send_packet+0xf3 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)

I am going to try to figure out what is the last stack trace before the thread get descheduled; perhaps that will shed some light on this issue.

Let me know if you have any insights here.

I figured out one issue here – I was using FFMPEG threads as well as CUDA to decode the video, and that was causing gaps in the timeline (because I was only looking at the main thread).

Following is the timeline with a single FFMPEG thread:

There are a few interesting points here:

  1. There are 2 NVTX bars in the profile. The one above has events with GPU and Context columns while the one below does not.
  2. The NVTX bar at the top has GPU and Context columns and the events are quite fast. They are measured in microseconds. The NVTX bar at the bottom has very long events (measured in milliseconds). So the GPU time for these blocks is actually quite low compared to CPU time. The overall walltime is large, though. I am guessing it’s doing some unnecessary synchronization but I am not sure where and how to remove that.
  3. I am still not able to figure out why avcodec_send_packet is so slow.
  4. GPU utilization is quite low – under 5%. I would like to get better utilization. I am guessing most of the time is just spent in the transfer from CPU to GPU and it’s not pipelined for some reason – we are paying the latency cost for every frame. How can we speed this up?
  5. I still don’t see any H2D transfers in the nsys profile. I am guessing this is part of cuvidDecodePicture and cuvidMapVideoFrame64, but it doesn’t show up separately.
  6. Decoding the first frame is slow. That is perhaps expected. Another interesting pattern is that some frame decodes are very fast (400us) while others are quite slow (17ms). They seem to be interleaved – fast, slow, fast, slow (see diagram below).
  7. If I run multiple processes of the decoder the overall GPU utilization goes up. That tells me there is plenty of capacity in the GPU but we are not pipelining things optimally per decode.

Fast and slow frames zoomed in:

Are you decoding real time?? Cause if you were to actually decode using -f null in ffmpeg it will be 100% utilisation in Video decode tab in windows process manager. On complex enough files.

Try this: download annex b file from here DXVA/D3D11 & H264 smeared/corrupted frame [$100 awarded] · Issue #923 · wang-bin/QtAV · GitHub

And then use ffmpeg.exe -c:v h264_cuvid -i error_4942.264 -f md5 -

It will be 100% utilisied.

Or try
ffmpeg.exe -c:v h264_cuvid -i error_4942.264 -f null NUL