Debugging slow NVDEC h264 decoding using FFMPEG -- time is spent in avcodec_send_packet()

cuda_user_88 · July 8, 2024, 3:56pm

Hi all,

I am using NVDEC with FFMPEG to decode videos and I am seeing decoding time slower than multicore CPU decoding.

It appears that most of the time is spent in avcodec_send_packet. I captured a trace using nsys and attached is a screenshot demonstrating the issue.

The code looks like as follows:

void decodeOneFrame() {
  nvtx3::scoped_range function_range{"decodeOneFrame"};
  while (true) {
    int status = avcodec_receive_frame();
    if (status == 0) break;
    ... = av_read_frame(...);  // Read the packet from the file
    {
      nvtx3::scoped_range function_range{"avcodec_send_packet"};
      avcodec_send_packet(...);
    }
  }
}
// Do the color conversion from NV12 to RGB on the device itself using NPP.

The profile looks like following:

For some reason avcodec_send_packet takes a long time, and only at the very end of the function do I see a call to cuvidMapVideoFrame64, ConvertNV12BLtoNV12, cuMemcpy2DAsync twice (one for Y and one for UV I am assuming), and ForEachPixelNaiveBy4 (I am assuming color conversion).

All these functions take a fraction of time compared to avcodec_send_packet. Why is that the case? I also don’t see any interesting memory copy or kernel on the GPU timeline during 90% of the time of avcodec_send_packet(). Only the last 10% of avcodec_send_packet() actually does device memcopies and runs device kernels. How do I make this faster?

Note that I am profiling with:

nsys profile --trace nvvideo,cuda,nvtx

Am I missing something? How do I make this video decoder go faster? What is the cuvid h264 codec doing in the first 90% of avcodec_send_packet()?

EDIT: It appears I can’t see any H2D memcpy’s even though I can see D2D memcpy’s and memsets. Maybe the “whitespace” in avcodec_send_packet is due to H2D transfers but they are not captured by the profiler?

val.zapod.vz · July 9, 2024, 12:26am

Are you joking? It is parsing the SEI, SPS, PPS, transforming the stream to ANNEX B, as legacy (well, still not deprecared) h264_cuvid require ANNEX B.

Try this: ffmpeg -i file.mp4 -c copy file.h264

cuda_user_88 · July 9, 2024, 5:36pm

Thank you for the reply @val.zapod.vz

My question was actually specific to the large amount of idle time that I see inside avcodec_send_packet where seemingly no work is done on the CPU or GPU (including memcopies). This could be an artifact of profiling itself, or perhaps I am doing something wrong here.

Here is a diagram to clarify my question. Please look at the text in red: “What is happening in this window?”

The main CPU thread is only running for a short amount of time. It is actually descheduled for most of the time as seen in the figure.

I actually collected some stack traces as well, and it appears to be servicing page faults in a lot of stack traces that were sampled.


pt_main_thread 3945124 2835920.743929:     541947 cycles: 
        ffffffff813657f7 clear_page_erms+0x7 (/boot/vmlinux-5.19.0-)
        ffffffff81019719 __folio_alloc+0xd99 (/boot/vmlinux-5.19.0-)
        ffffffff8108a8c7 vma_alloc_folio+0x217 (/boot/vmlinux-5.19.0-)
        ffffffff81087769 handle_mm_fault+0x559 (/boot/vmlinux-5.19.0-)
        ffffffff8107e31a do_user_addr_fault+0x1ba (/boot/vmlinux-5.19.0-)
        ffffffff81da9bdf exc_page_fault+0x5f (/boot/vmlinux-5.19.0-)
        ffffffff81e00b62 asm_exc_page_fault+0x22 (/boot/vmlinux-5.19.0-)
            7f5d2e56b47a __memset_evex_unaligned_erms+0x7a (/usr/lib64/libc.so.6)
            7f5c31afa2a9 ff_h264_alloc_tables+0x79 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31af043e h264_slice_header_init+0x2ee (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31af1d38 ff_h264_update_thread_context+0x2d8 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31cf097e update_context_from_thread+0x27e (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31cf0f44 ff_thread_decode_frame+0x224 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31a0bf82 decode_receive_frame_internal+0x4a2 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)
            7f5c31a0c3a3 avcodec_send_packet+0xf3 (/home/user/.conda/envs/tc/lib/libavcodec.so.60.31.102)

I am going to try to figure out what is the last stack trace before the thread get descheduled; perhaps that will shed some light on this issue.

Let me know if you have any insights here.

cuda_user_88 · July 19, 2024, 9:10pm

I figured out one issue here – I was using FFMPEG threads as well as CUDA to decode the video, and that was causing gaps in the timeline (because I was only looking at the main thread).

Following is the timeline with a single FFMPEG thread:

There are a few interesting points here:

There are 2 NVTX bars in the profile. The one above has events with GPU and Context columns while the one below does not.
The NVTX bar at the top has GPU and Context columns and the events are quite fast. They are measured in microseconds. The NVTX bar at the bottom has very long events (measured in milliseconds). So the GPU time for these blocks is actually quite low compared to CPU time. The overall walltime is large, though. I am guessing it’s doing some unnecessary synchronization but I am not sure where and how to remove that.
I am still not able to figure out why avcodec_send_packet is so slow.
GPU utilization is quite low – under 5%. I would like to get better utilization. I am guessing most of the time is just spent in the transfer from CPU to GPU and it’s not pipelined for some reason – we are paying the latency cost for every frame. How can we speed this up?
I still don’t see any H2D transfers in the nsys profile. I am guessing this is part of cuvidDecodePicture and cuvidMapVideoFrame64, but it doesn’t show up separately.
Decoding the first frame is slow. That is perhaps expected. Another interesting pattern is that some frame decodes are very fast (400us) while others are quite slow (17ms). They seem to be interleaved – fast, slow, fast, slow (see diagram below).
If I run multiple processes of the decoder the overall GPU utilization goes up. That tells me there is plenty of capacity in the GPU but we are not pipelining things optimally per decode.

Fast and slow frames zoomed in:

val.zapod.vz · July 21, 2024, 6:16am

Are you decoding real time?? Cause if you were to actually decode using -f null in ffmpeg it will be 100% utilisation in Video decode tab in windows process manager. On complex enough files.

Try this: download annex b file from here DXVA/D3D11 & H264 smeared/corrupted frame [$100 awarded] · Issue #923 · wang-bin/QtAV · GitHub

And then use ffmpeg.exe -c:v h264_cuvid -i error_4942.264 -f md5 -

It will be 100% utilisied.

Or try
ffmpeg.exe -c:v h264_cuvid -i error_4942.264 -f null NUL

val.zapod.vz · August 7, 2024, 4:21pm

I did a typo

ffmpeg.exe -c:v h264_cuvid -i example.h264 -f null NUL

Then it is 100% utilised. And also there is maxmium energy used when this is done. Nvidia-smi reports P2 and 67 W used.

cuda_user_88 · August 9, 2024, 2:33pm

Thank you for the replies @val.zapod.vz

A few things to note:

I am not using ffmpeg as a binary – I am using ffmpeg as a library.
I am using linux, not windows, so D3D is not available.

I am actually using this library:

The repro instructions are as follows:

conda create --name repro
conda activate repro

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia --yes

# 2. Install prereqs for FFMPEG GPU decoding:
# Reference: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.0/ffmpeg-with-nvidia-gpu/index.html
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
make PREFIX=$CONDA_PREFIX install

# 3. install FFMPEG:
git clone https://github.com/FFmpeg/FFmpeg.git
cd FFmpeg
git checkout origin/release/6.1
./configure --enable-nonfree --enable-cuda-nvcc --enable-libnpp --prefix=$CONDA_PREFIX
make -j install

# Install build dependencies of torchcodec
conda install cmake pkg-config --yes

# Build and install torchcodec from source with CUDA enabled
git clone https://github.com/pytorch/torchcodec.git

cd torchcodec
CMAKE_BUILD_PARALLEL_LEVEL=8 CXXFLAGS="" CMAKE_BUILD_TYPE=Release ENABLE_CUDA=1 ENABLE_NVTX=1 pip install -e . --no-build-isolation -vv --debug

# Run the benchmark to test it out
python ./benchmarks/decoders/gpu_benchmark.py --video=<path to mp4> --devices=cuda:0 --resize_devices=none

# Run the benchmark under nsys
nsys profile --trace nvvideo,cuda,nvtx $(which python) ./benchmarks/decoders/gpu_benchmark.py --video=<path to mp4> --devices=cuda:0 --resize_devices=none

nsys will now produce the trace that you see in the screenshot above.

I suspect that the code is not pipelining something here. We are sending one packet at a time to the hardware. Potentially we could speed things up by pipelining and having more packets in flight so the CPU->GPU latency is hidden.

The strange thing is that the CPU->GPU transfer is not even shown on nsys – it seems like cuvidMapVideoFrame64 includes that transfer latency and cannot break it down further, so visually it’s hard to tell what is going on.

If someone from Nvidia can audit the code here, that would be amazing:

Setting up of the CUDA decoding happens here:

github.com

pytorch/torchcodec/blob/dbfef1223522639d2b036a185b444eecf7748466/src/torchcodec/decoders/_core/VideoDecoder.cpp#L112


      
            while ((end = str.find_first_of(delims, start)) != std::string::npos) {
              result.push_back(str.substr(start, end - start));
              start = end + 1;
            }
            result.push_back(str.substr(start));
            return result;
          }
          
          #ifdef ENABLE_CUDA
          
          AVBufferRef* getCudaContext() {
            enum AVHWDeviceType type = av_hwdevice_find_type_by_name("cuda");
            TORCH_CHECK(type != AV_HWDEVICE_TYPE_NONE, "Failed to find cuda device");
            int err = 0;
            AVBufferRef* hw_device_ctx;
            err = av_hwdevice_ctx_create(
                &hw_device_ctx,
                type,
                nullptr,
                nullptr,
            // Introduced in 58.26.100:

avcodec_send_packet happens here:

github.com

pytorch/torchcodec/blob/dbfef1223522639d2b036a185b444eecf7748466/src/torchcodec/decoders/_core/VideoDecoder.cpp#L850


      
              VLOG(9) << "Got packet: stream_index=" << packet->stream_index
                      << " pts=" << packet->pts << " size=" << packet->size;
              if (activeStreamIndices_.count(packet->stream_index) == 0) {
                // This packet is not for any of the active streams.
                continue;
              }
              {
          #ifdef ENABLE_NVTX
                nvtx3::scoped_range loop{"avcodec_send_packet"};
          #endif
                ffmpegStatus = avcodec_send_packet(
                    streams_[packet->stream_index].codecContext.get(), packet.get());
              }
              decodeStats_.numPacketsSentToDecoder++;
              if (ffmpegStatus < AVSUCCESS) {
                throw std::runtime_error(
                    "Could not push packet to decoder: " +
                    getFFMPEGErrorStringFromErrorCode(ffmpegStatus));
              }
            }
            if (ffmpegStatus < AVSUCCESS) {

We are doing single-threaded demuxing, but I suspect that’s not the actual issue. I strongly suspect having more packets in flight to the decoder could speed things up here, but I don’t see a way to do that in ffmpeg if it creates the device itself. Perhaps we need to create the hw device outside ffmepg and give it to ffmpeg?

Any insights would be appreciated.

val.zapod.vz · August 13, 2024, 4:24pm

“am using linux, not windows, so D3D is not available.”

First if all it is, second, what does that have to do with anything? My examples use cuvid without d3d11va.

“am not using ffmpeg as a binary – I am using ffmpeg as a library.”

Ffmpeg.exe, the application, is also using ffmpeg as a library.

cuda_user_88 · September 30, 2024, 8:37pm

Since it is kind of long-winded to reproduce this, I wrote a small script to help with reproducibility/debugging:

# 1. Create a conda env
conda create --name tcgpu1 --yes
conda activate tcgpu1

# 2. Install general pre-reqs
conda install conda libmamba --yes
$CONDA_PREFIX/bin/conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia --yes
$CONDA_PREFIX/bin/conda install nasm --yes
$CONDA_PREFIX/bin/conda install conda-forge::compilers --yes
$CONDA_PREFIX/bin/conda install pkg-config --yes
$CONDA_PREFIX/bin/conda install nvidia/label/cuda-12.4.0::cuda-toolkit --yes
$CONDA_PREFIX/bin/conda install nvidia/label/cuda-12.4.0::cuda-cudart --yes

# 3. Install nvidia video pre-reqs
TMPPATH=$(mktemp -d)
cd $TMPPATH
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
cd nv-codec-headers
make PREFIX=$CONDA_PREFIX install


cd $TMPPATH
git clone https://github.com/FFmpeg/FFmpeg.git
cd FFmpeg

./configure --prefix=$CONDA_PREFIX --enable-nonfree --enable-cuda-nvcc --enable-libnpp --disable-static --enable-shared --optflags=-fno-omit-frame-pointer --disable-stripping
make -j install

# 4. Build and install torchcodec
cd $TMPPATH
git clone https://github.com/pytorch/torchcodec.git
cd torchcodec
git checkout origin/cuda_support
find . -name \*.so | xargs rm -rf; CMAKE_BUILD_PARALLEL_LEVEL=8 CXXFLAGS="" CMAKE_BUILD_TYPE=Release ENABLE_CUDA=1 ENABLE_NVTX=1 pip install -e . --no-build-isolation -vv --debug

# 5. Now you can run the GPU benchmark.
# Just for testing, run it on an existing small video that's part of the repo:
python benchmarks/decoders/gpu_benchmark.py --resize_devices=none

# 6. Now run it with the nsys profiler on a HD video.
nsys profile -t cuda,nvtx,osrt,nvvideo --gpu-video-device=all --event-sample system-wide --sampling-period=281250 -s system-wide python benchmarks/decoders/gpu_benchmark.py --resize_devices=none --devices=cuda:0 --video 853.mp4

The last command above produces an nsys profile (renamed as a .log file) that I am attaching to this post.

What’s notable is that cuvidMapVideoFrame64 takes 11ms or so periodically here:

The high cuvidMapVideoFrame64 time is what I am trying to reduce here.

I am attaching the profile in case NVDEC experts want to look at it. Note that this forum doesn’t accept nsys-rep files as attachments so I renamed it to .nsys-rep.log.

All the code is in the torchcodec repo in the cuda_support branch. Let me know if there is more data I can provide here.

report3.nsys-rep.log (1.3 MB)

Topic		Replies	Views
do i need FFmpeg for the parsing the video for decoding to NVIDIA gpu or NVDEC can handel it CUDA Programming and Performance	0	385	March 25, 2019
[Linux] NVCuvid - Performarce CUDA Programming and Performance	13	4184	March 9, 2016
nvEncoder only encodes first 240 frames Other Tools	0	2317	August 13, 2014
High CPU usage in omxh264dec and low performance as compared to nv_omx_h264dec Jetson TK1	20	5001	October 18, 2021
Difference in performace for parallell decode encode with ffmpeg h264_cuvid and h264_nvenc Tesla P100 GPU-Accelerated Libraries	0	1458	November 14, 2017
Decoder SDK - how to access decoded frame? Video Processing & Optical Flow	28	8583	August 13, 2019
Inconsistent and very long H.264 encoding latency Jetson TX2	21	3481	June 7, 2018
video encode speed Jetson TX2	19	4832	October 18, 2021
Any example on real time video processing CUDA Programming and Performance	12	4021	January 6, 2012
[Jetson Xavier]Hardware video decode doesn't work Jetson AGX Xavier	12	1248	May 27, 2019

Debugging slow NVDEC h264 decoding using FFMPEG -- time is spent in avcodec_send_packet()

Related topics