I am using NVDEC with FFMPEG to decode videos and I am seeing decoding time slower than multicore CPU decoding.
It appears that most of the time is spent in avcodec_send_packet. I captured a trace using nsys and attached is a screenshot demonstrating the issue.
The code looks like as follows:
void decodeOneFrame() {
nvtx3::scoped_range function_range{"decodeOneFrame"};
while (true) {
int status = avcodec_receive_frame();
if (status == 0) break;
... = av_read_frame(...); // Read the packet from the file
{
nvtx3::scoped_range function_range{"avcodec_send_packet"};
avcodec_send_packet(...);
}
}
}
// Do the color conversion from NV12 to RGB on the device itself using NPP.
For some reason avcodec_send_packet takes a long time, and only at the very end of the function do I see a call to cuvidMapVideoFrame64, ConvertNV12BLtoNV12, cuMemcpy2DAsync twice (one for Y and one for UV I am assuming), and ForEachPixelNaiveBy4 (I am assuming color conversion).
All these functions take a fraction of time compared to avcodec_send_packet. Why is that the case? I also don’t see any interesting memory copy or kernel on the GPU timeline during 90% of the time of avcodec_send_packet(). Only the last 10% of avcodec_send_packet() actually does device memcopies and runs device kernels. How do I make this faster?
Note that I am profiling with:
nsys profile --trace nvvideo,cuda,nvtx
Am I missing something? How do I make this video decoder go faster? What is the cuvid h264 codec doing in the first 90% of avcodec_send_packet()?
EDIT: It appears I can’t see any H2D memcpy’s even though I can see D2D memcpy’s and memsets. Maybe the “whitespace” in avcodec_send_packet is due to H2D transfers but they are not captured by the profiler?
Are you joking? It is parsing the SEI, SPS, PPS, transforming the stream to ANNEX B, as legacy (well, still not deprecared) h264_cuvid require ANNEX B.
My question was actually specific to the large amount of idle time that I see inside avcodec_send_packet where seemingly no work is done on the CPU or GPU (including memcopies). This could be an artifact of profiling itself, or perhaps I am doing something wrong here.
Here is a diagram to clarify my question. Please look at the text in red: “What is happening in this window?”
I figured out one issue here – I was using FFMPEG threads as well as CUDA to decode the video, and that was causing gaps in the timeline (because I was only looking at the main thread).
Following is the timeline with a single FFMPEG thread:
There are 2 NVTX bars in the profile. The one above has events with GPU and Context columns while the one below does not.
The NVTX bar at the top has GPU and Context columns and the events are quite fast. They are measured in microseconds. The NVTX bar at the bottom has very long events (measured in milliseconds). So the GPU time for these blocks is actually quite low compared to CPU time. The overall walltime is large, though. I am guessing it’s doing some unnecessary synchronization but I am not sure where and how to remove that.
I am still not able to figure out why avcodec_send_packet is so slow.
GPU utilization is quite low – under 5%. I would like to get better utilization. I am guessing most of the time is just spent in the transfer from CPU to GPU and it’s not pipelined for some reason – we are paying the latency cost for every frame. How can we speed this up?
I still don’t see any H2D transfers in the nsys profile. I am guessing this is part of cuvidDecodePicture and cuvidMapVideoFrame64, but it doesn’t show up separately.
Decoding the first frame is slow. That is perhaps expected. Another interesting pattern is that some frame decodes are very fast (400us) while others are quite slow (17ms). They seem to be interleaved – fast, slow, fast, slow (see diagram below).
If I run multiple processes of the decoder the overall GPU utilization goes up. That tells me there is plenty of capacity in the GPU but we are not pipelining things optimally per decode.
Are you decoding real time?? Cause if you were to actually decode using -f null in ffmpeg it will be 100% utilisation in Video decode tab in windows process manager. On complex enough files.
I am not using ffmpeg as a binary – I am using ffmpeg as a library.
I am using linux, not windows, so D3D is not available.
I am actually using this library:
The repro instructions are as follows:
conda create --name repro
conda activate repro
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia --yes
# 2. Install prereqs for FFMPEG GPU decoding:
# Reference: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.0/ffmpeg-with-nvidia-gpu/index.html
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
make PREFIX=$CONDA_PREFIX install
# 3. install FFMPEG:
git clone https://github.com/FFmpeg/FFmpeg.git
cd FFmpeg
git checkout origin/release/6.1
./configure --enable-nonfree --enable-cuda-nvcc --enable-libnpp --prefix=$CONDA_PREFIX
make -j install
# Install build dependencies of torchcodec
conda install cmake pkg-config --yes
# Build and install torchcodec from source with CUDA enabled
git clone https://github.com/pytorch/torchcodec.git
cd torchcodec
CMAKE_BUILD_PARALLEL_LEVEL=8 CXXFLAGS="" CMAKE_BUILD_TYPE=Release ENABLE_CUDA=1 ENABLE_NVTX=1 pip install -e . --no-build-isolation -vv --debug
# Run the benchmark to test it out
python ./benchmarks/decoders/gpu_benchmark.py --video=<path to mp4> --devices=cuda:0 --resize_devices=none
# Run the benchmark under nsys
nsys profile --trace nvvideo,cuda,nvtx $(which python) ./benchmarks/decoders/gpu_benchmark.py --video=<path to mp4> --devices=cuda:0 --resize_devices=none
nsys will now produce the trace that you see in the screenshot above.
I suspect that the code is not pipelining something here. We are sending one packet at a time to the hardware. Potentially we could speed things up by pipelining and having more packets in flight so the CPU->GPU latency is hidden.
The strange thing is that the CPU->GPU transfer is not even shown on nsys – it seems like cuvidMapVideoFrame64 includes that transfer latency and cannot break it down further, so visually it’s hard to tell what is going on.
If someone from Nvidia can audit the code here, that would be amazing:
Setting up of the CUDA decoding happens here:
avcodec_send_packet happens here:
We are doing single-threaded demuxing, but I suspect that’s not the actual issue. I strongly suspect having more packets in flight to the decoder could speed things up here, but I don’t see a way to do that in ffmpeg if it creates the device itself. Perhaps we need to create the hw device outside ffmepg and give it to ffmpeg?
The high cuvidMapVideoFrame64 time is what I am trying to reduce here.
I am attaching the profile in case NVDEC experts want to look at it. Note that this forum doesn’t accept nsys-rep files as attachments so I renamed it to .nsys-rep.log.
All the code is in the torchcodec repo in the cuda_support branch. Let me know if there is more data I can provide here.