Decoder performance: gstreamer (nvv4l2decoder) vs ffmpeg (h264_nvv4l2dec)


What magic tricks or settings allow gstreamer’s nvv4l2decoder outperform ffmpeg’s h264_nvv4l2dec more than 2x in h264 1080p decoding?

The tests:

  • gst-launch-1.0 filesrc location= jellyfish-5-mbps-hd-h264.mkv ! matroskademux ! h264parse ! nvv4l2decoder enable-max-performance=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 -v
    • 260+ fps
  • ffmpeg -c:v h264_nvv4l2dec -i jellyfish-5-mbps-hd-h264.mkv -c:v rawvideo -f null -
    • 120+ fps
  • ffmpeg -c:v h264 -threads 4 -i jellyfish-5-mbps-hd-h264.mkv -c:v rawvideo -f null -
    • 110+ fps (software decoder!)

And some observations:

  • 4 cortex-a57 cores able to achieve almost the same decoding speed as NVDEC DSP with an ffmpeg implementation
  • two ffmpeg decoding processes may be started simultaneously and fps won’t drop: each will still be decoding 120fps
    • this means there are many hardware resources remains available when only one ffmpeg process is running
  • changing an output pixel format from YUV420 to NV12 increases a ffmpeg’s h264_nvv4l2dec performance a lot, but still far from a gstreamer:
    • nvv4l2dec_create_decoder(avctx, nv_codec_type, V4L2_PIX_FMT_NV12M /*V4L2_PIX_FMT_YUV420M*/);
    • after that the ffmpeg’s decoding framerate jumps to 160+fps
  • enabling V4L2_CID_MPEG_VIDEO_MAX_PERFORMANCE does nothing
    • ret = set_ext_controls(ctx->fd, V4L2_CID_MPEG_VIDEO_MAX_PERFORMANCE, 1);
    • framerate is the same
    • disabling options enable-max-performance in a gstreamer also doesn’t change it’s performance
  • there is GitHub - jocover/jetson-ffmpeg: ffmpeg support on jetson nano implementation of a codec with the jetson multimedia API
    • it uses NvVideoDecoder class from Video Decoder API
    • performance slightly less than the h264_nvv4l2dec but near
  • maximizing jetson nano performance with ‘nvpmodel’ and ‘jetson_clocks’ increases performance for all implementations but ffmpeg’s h264_nvv4l2dec still remains 2 times slower than gstreamer’s nvv4l2decoder

Right now I think the main culprit is NvBufferTransform. If removed, the framerate increases to similar as with a gstreamer. Naturally a mere disablement of NvBufferTransform completely discards output of decoder, therefore is useless by itself. But I’ve tried to change the code in nvv4l2_dec.c so it extracts data from a source buffers instead of a destination buffer. Like this:

NvBuffer2Raw(decoded_buffer->planes[0].fd, 0, parm.width[0], parm.height[0], ...

Strangely after that the framerate dropped dramatically to about 60fps. That I don’t understand.

Actual questions:

  • Which nuances of the gstreamer implementation allows it to achieve 260fps in h264 decoding?
    • What can be done to the ffmpeg h264_nvv4l2dec implementation to achieve performance similar to the gstreamer?
  • Is it possible to drop NvBufferTransform in favor of retrieving frame data directly from source buffers?
    • Why in your opinion my attempt to retrieve data directly from source buffer instead of NvBufferTransform has dropped performance?

Thank you in advance.

The hardware-decoded frame data is in NvBuffer and for working with ffmpeg frameworks, have to copy frame data from NvBuffer to CPU buffer. For optimal solution, we would suggest use gstreamer or jetson_multimedia_api. You can directly access the buffers through NvBuffer APIs to have data in NvBuffer from head to tail, to eliminate the memory copy.

By ‘use jetson_multimedia_api’ did you meant NvEGLImageFromFd call with glEGLImageTargetTexture2DOES? If so, please clarify:

  • Do I have to call NvBufferTransform before NvEGLImageFromFd call, or it’s possible to bind v4l2 device’s capture buffer?
    • If no call to NvBufferTransform needed, then would it really be much faster than copying frame to CPU buffer? I mean, samplerExternalOES in a framgment shader would probably do (implicitly) the same transformations as the NvBufferTransform. Considering NvBufferTransform is one of the most time consuming operations out there, isn’t it just going to increase draw call time respectively?

For video decoding, please refer to this sample:


All samples are in the folder. More information is in document:
Jetson Linux API Reference: Main Page

NvBufferTransform() is done via hardware converter so it does not take CPU usage and not have much latency.

Thanks for the answers!

Meanwhile I’ve build 00_video_decode sample with Easy Profiler (GitHub - yse/easy_profiler: Lightweight profiler library for c++) and have some pictures to share:

No conclusions, just a couple of tests in the hope it might be useful for someone.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.