Debugging slow NVDEC h264 decoding using FFMPEG -- time is spent in avcodec_send_packet()

Hi all,

I am using NVDEC with FFMPEG to decode videos and I am seeing decoding time slower than multicore CPU decoding.

It appears that most of the time is spent in avcodec_send_packet. I captured a trace using nsys and attached is a screenshot demonstrating the issue.

The code looks like as follows:

void decodeOneFrame() {
  nvtx3::scoped_range function_range{"decodeOneFrame"};
  while (true) {
    int status = avcodec_receive_frame();
    if (status == 0) break;
    ... = av_read_frame(...);  // Read the packet from the file
    {
      nvtx3::scoped_range function_range{"avcodec_send_packet"};
      avcodec_send_packet(...);
    }
  }
}
// Do the color conversion from NV12 to RGB on the device itself using NPP.

The profile looks like following:

For some reason avcodec_send_packet takes a long time, and only at the very end of the function do I see a call to cuvidMapVideoFrame64, ConvertNV12BLtoNV12, cuMemcpy2DAsync twice (one for Y and one for UV I am assuming), and ForEachPixelNaiveBy4 (I am assuming color conversion).

All these functions take a fraction of time compared to avcodec_send_packet. Why is that the case? I also don’t see any interesting memory copy or kernel on the GPU timeline during 90% of the time of avcodec_send_packet(). Only the last 10% of avcodec_send_packet() actually does device memcopies and runs device kernels. How do I make this faster?

Note that I am profiling with:

nsys profile --trace nvvideo,cuda,nvtx

Am I missing something? How do I make this video decoder go faster? What is the cuvid h264 codec doing in the first 90% of avcodec_send_packet()?

EDIT: It appears I can’t see any H2D memcpy’s even though I can see D2D memcpy’s and memsets. Maybe the “whitespace” in avcodec_send_packet is due to H2D transfers but they are not captured by the profiler?

Are you joking? It is parsing the SEI, SPS, PPS, transforming the stream to ANNEX B, as legacy (well, still not deprecared) h264_cuvid require ANNEX B.

Try this: ffmpeg -i file.mp4 -c copy file.h264

Thank you for the reply @val.zapod.vz

My question was actually specific to the large amount of idle time that I see inside avcodec_send_packet where seemingly no work is done on the CPU or GPU (including memcopies). This could be an artifact of profiling itself, or perhaps I am doing something wrong here.

Here is a diagram to clarify my question. Please look at the text in red: “What is happening in this window?”

The main CPU thread is only running for a short amount of time. It is actually descheduled for most of the time as seen in the figure.

I actually collected some stack traces as well, and it appears to be servicing page faults in a lot of stack traces that were sampled.


pt_main_thread 3945124 2835920.743929:     541947 cycles: 
        ffffffff813657f7 clear_page_erms+0x7 (/boot/vmlinux-5.19.0-)
        ffffffff81019719 __folio_alloc+0xd99 (/boot/vmlinux-5.19.0-)
        ffffffff8108a8c7 vma_alloc_folio+0x217 (/boot/vmlinux-5.19.0-)
        ffffffff81087769 handle_mm_fault+0x559 (/boot/vmlinux-5.19.0-)
        ffffffff8107e31a do_user_addr_fault+0x1ba (/boot/vmlinux-5.19.0-)
        ffffffff81da9bdf exc_page_fault+0x5f (/boot/vmlinux-5.19.0-)
        ffffffff81e00b62 asm_exc_page_fault+0x22 (/boot/vmlinux-5.19.0-)
            7f5d2e56b47a __memset_evex_unaligned_erms+0x7a (/usr/lib64/libc.so.6)
            7f5c31afa2a9 ff_h264_alloc_tables+0x79 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31af043e h264_slice_header_init+0x2ee (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31af1d38 ff_h264_update_thread_context+0x2d8 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31cf097e update_context_from_thread+0x27e (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31cf0f44 ff_thread_decode_frame+0x224 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31a0bf82 decode_receive_frame_internal+0x4a2 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31a0c3a3 avcodec_send_packet+0xf3 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)

I am going to try to figure out what is the last stack trace before the thread get descheduled; perhaps that will shed some light on this issue.

Let me know if you have any insights here.

I figured out one issue here – I was using FFMPEG threads as well as CUDA to decode the video, and that was causing gaps in the timeline (because I was only looking at the main thread).

Following is the timeline with a single FFMPEG thread:

There are a few interesting points here:

  1. There are 2 NVTX bars in the profile. The one above has events with GPU and Context columns while the one below does not.
  2. The NVTX bar at the top has GPU and Context columns and the events are quite fast. They are measured in microseconds. The NVTX bar at the bottom has very long events (measured in milliseconds). So the GPU time for these blocks is actually quite low compared to CPU time. The overall walltime is large, though. I am guessing it’s doing some unnecessary synchronization but I am not sure where and how to remove that.
  3. I am still not able to figure out why avcodec_send_packet is so slow.
  4. GPU utilization is quite low – under 5%. I would like to get better utilization. I am guessing most of the time is just spent in the transfer from CPU to GPU and it’s not pipelined for some reason – we are paying the latency cost for every frame. How can we speed this up?
  5. I still don’t see any H2D transfers in the nsys profile. I am guessing this is part of cuvidDecodePicture and cuvidMapVideoFrame64, but it doesn’t show up separately.
  6. Decoding the first frame is slow. That is perhaps expected. Another interesting pattern is that some frame decodes are very fast (400us) while others are quite slow (17ms). They seem to be interleaved – fast, slow, fast, slow (see diagram below).
  7. If I run multiple processes of the decoder the overall GPU utilization goes up. That tells me there is plenty of capacity in the GPU but we are not pipelining things optimally per decode.

Fast and slow frames zoomed in:

Are you decoding real time?? Cause if you were to actually decode using -f null in ffmpeg it will be 100% utilisation in Video decode tab in windows process manager. On complex enough files.

Try this: download annex b file from here DXVA/D3D11 & H264 smeared/corrupted frame [$100 awarded] · Issue #923 · wang-bin/QtAV · GitHub

And then use ffmpeg.exe -c:v h264_cuvid -i error_4942.264 -f md5 -

It will be 100% utilisied.

Or try
ffmpeg.exe -c:v h264_cuvid -i error_4942.264 -f null NUL

I did a typo

ffmpeg.exe -c:v h264_cuvid -i example.h264 -f null NUL

Then it is 100% utilised. And also there is maxmium energy used when this is done. Nvidia-smi reports P2 and 67 W used.

Thank you for the replies @val.zapod.vz

A few things to note:

  1. I am not using ffmpeg as a binary – I am using ffmpeg as a library.
  2. I am using linux, not windows, so D3D is not available.

I am actually using this library:

The repro instructions are as follows:

conda create --name repro
conda activate repro

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia --yes

# 2. Install prereqs for FFMPEG GPU decoding:
# Reference: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.0/ffmpeg-with-nvidia-gpu/index.html
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
make PREFIX=$CONDA_PREFIX install

# 3. install FFMPEG:
git clone https://github.com/FFmpeg/FFmpeg.git
cd FFmpeg
git checkout origin/release/6.1
./configure --enable-nonfree --enable-cuda-nvcc --enable-libnpp --prefix=$CONDA_PREFIX
make -j install

# Install build dependencies of torchcodec
conda install cmake pkg-config --yes

# Build and install torchcodec from source with CUDA enabled
git clone https://github.com/pytorch/torchcodec.git

cd torchcodec
CMAKE_BUILD_PARALLEL_LEVEL=8 CXXFLAGS="" CMAKE_BUILD_TYPE=Release ENABLE_CUDA=1 ENABLE_NVTX=1 pip install -e . --no-build-isolation -vv --debug

# Run the benchmark to test it out
python ./benchmarks/decoders/gpu_benchmark.py --video=<path to mp4> --devices=cuda:0 --resize_devices=none

# Run the benchmark under nsys
nsys profile --trace nvvideo,cuda,nvtx $(which python) ./benchmarks/decoders/gpu_benchmark.py --video=<path to mp4> --devices=cuda:0 --resize_devices=none

nsys will now produce the trace that you see in the screenshot above.

I suspect that the code is not pipelining something here. We are sending one packet at a time to the hardware. Potentially we could speed things up by pipelining and having more packets in flight so the CPU->GPU latency is hidden.

The strange thing is that the CPU->GPU transfer is not even shown on nsys – it seems like cuvidMapVideoFrame64 includes that transfer latency and cannot break it down further, so visually it’s hard to tell what is going on.

If someone from Nvidia can audit the code here, that would be amazing:

Setting up of the CUDA decoding happens here:

avcodec_send_packet happens here:

We are doing single-threaded demuxing, but I suspect that’s not the actual issue. I strongly suspect having more packets in flight to the decoder could speed things up here, but I don’t see a way to do that in ffmpeg if it creates the device itself. Perhaps we need to create the hw device outside ffmepg and give it to ffmpeg?

Any insights would be appreciated.

“am using linux, not windows, so D3D is not available.”

First if all it is, second, what does that have to do with anything? My examples use cuvid without d3d11va.

“am not using ffmpeg as a binary – I am using ffmpeg as a library.”

Ffmpeg.exe, the application, is also using ffmpeg as a library.

Since it is kind of long-winded to reproduce this, I wrote a small script to help with reproducibility/debugging:

# 1. Create a conda env
conda create --name tcgpu1 --yes
conda activate tcgpu1

# 2. Install general pre-reqs
conda install conda libmamba --yes
$CONDA_PREFIX/bin/conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia --yes
$CONDA_PREFIX/bin/conda install nasm --yes
$CONDA_PREFIX/bin/conda install conda-forge::compilers --yes
$CONDA_PREFIX/bin/conda install pkg-config --yes
$CONDA_PREFIX/bin/conda install nvidia/label/cuda-12.4.0::cuda-toolkit --yes
$CONDA_PREFIX/bin/conda install nvidia/label/cuda-12.4.0::cuda-cudart --yes

# 3. Install nvidia video pre-reqs
TMPPATH=$(mktemp -d)
cd $TMPPATH
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
cd nv-codec-headers
make PREFIX=$CONDA_PREFIX install


cd $TMPPATH
git clone https://github.com/FFmpeg/FFmpeg.git
cd FFmpeg

./configure --prefix=$CONDA_PREFIX --enable-nonfree --enable-cuda-nvcc --enable-libnpp --disable-static --enable-shared --optflags=-fno-omit-frame-pointer --disable-stripping
make -j install

# 4. Build and install torchcodec
cd $TMPPATH
git clone https://github.com/pytorch/torchcodec.git
cd torchcodec
git checkout origin/cuda_support
find . -name \*.so | xargs rm -rf; CMAKE_BUILD_PARALLEL_LEVEL=8 CXXFLAGS="" CMAKE_BUILD_TYPE=Release ENABLE_CUDA=1 ENABLE_NVTX=1 pip install -e . --no-build-isolation -vv --debug

# 5. Now you can run the GPU benchmark.
# Just for testing, run it on an existing small video that's part of the repo:
python benchmarks/decoders/gpu_benchmark.py --resize_devices=none

# 6. Now run it with the nsys profiler on a HD video.
nsys profile -t cuda,nvtx,osrt,nvvideo --gpu-video-device=all --event-sample system-wide --sampling-period=281250 -s system-wide python benchmarks/decoders/gpu_benchmark.py --resize_devices=none --devices=cuda:0 --video 853.mp4

The last command above produces an nsys profile (renamed as a .log file) that I am attaching to this post.

What’s notable is that cuvidMapVideoFrame64 takes 11ms or so periodically here:

The high cuvidMapVideoFrame64 time is what I am trying to reduce here.

I am attaching the profile in case NVDEC experts want to look at it. Note that this forum doesn’t accept nsys-rep files as attachments so I renamed it to .nsys-rep.log.

All the code is in the torchcodec repo in the cuda_support branch. Let me know if there is more data I can provide here.

report3.nsys-rep.log (1.3 MB)