How to achieve the H.264 encoding performance: 4K (3,840x2,160)/30fps with OpenMAX IL API/L4T R24.1

I want to make a 4k/30fps streaming server application on the Jetson TX1.

  • L4T: R24.1 (32bit)
  • Input: USB 3.0, YUV(I420), 3,840x2,160/30fps (via HDMItoUSB3 converter)
  • Output: Ethernet, RTP/UDP/IP
  • Encode Buffer: allocated by cudaHostAlloc()

I have coded a capture module with v4l2 interface userptr mode as a C++ program.
(Refered v4l2 sample code)

But it is over 40msec from enqueue to dequeue of OpenMAX Encoder,
it seems that I cannot achieve 30fps encoding.

In accordance with the release note R24.1, I’ve tried to maximize the CPU/GPU performance.
Unfortunately, it is almost the same before running scripts.

What is the maximum frequency of CPU core ?
–> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
… 1734000
(Is the max 1.734GHz ?)

(The GPU bus can be set to about 1GHz.)

Can I get 30fps performance with TX1 and OpenMAX IL API (without gstreamer element)?
Would anyone give advice to me how to achieve the performance ?

Hi,

USB3.0 bandwidth is not enought to do YUV(I420), 3,840x2,160/30fps.
You dan do the multiplication, I thing 9fps would be the maximum.

Hi, malikcis,

Thanks for your comment.

I’ve confirmed that the bandwidth of USB 3.0 is enough,
and the interval of capturing of 4k image via USB 3.0 is about 33msec(29-35msec) on a JETSON TX1 development board.

The size of 4k image,

  • 3,840 x 2,160 = 8,294,400 pixels
  • 1 YUV: 12bit x 8,294,400 pixels = 99,532,800 bits
  • bit rate: 99,532,800 bits x 30fps = 2.985,984 Gbps

The bandwidth of USB 3.0 is 5Gbps, it is enough to transfer 4k/30fps YUV I420.

I tried to,
(a) Capture 4k image via USB 3.0 — (1)
(b)-1 Enqueue 1 image to encode — (2)
(b)-2 Discard 1 image alternativly
© Dequeue H.264 code from encoder — (3)

The interval of (1) is about 33msec(29-35msec).
The differnce between (2) and (3) is over 40msec.

I suspect that the OpenMAX IL API is not well optimized for the maximum performance.
I’m still waiting for some advice and suggestion.

Hi,

Here is the right computation:
4K in YUV 4:2:0 and 12bit
12bit x 3840 x 2160 x 1,5 x 30fps = 4,48Gbits/sec
1,5 because of I420 color space decimation
knowing that USB3.0 actual bandwidth is lower than 5Gbits/sec (USB bus is shared with other clients. Also implementation efficiency is less than max). you may end up with 50% of the 5Gbits/sec which is 2,5gbits/sec.
from this you see that even reaching 15fps is tuff.

However if the camera has inbuilt compression it may be a good idea to use that to transfer all images and do transcoding in the GPU afterwards.

I suggest you to used a camera with HDMI/DVI output and this:

Hi malikcis,

I appreciate you to advise me twice.

Sorry my poor explanation.
“12bit” means Y:8bit + U:8bit x 1/4 + V:8bit x 1/4.

I already have successed to capture 4k/30fps.
My C++ code using V4L2 interface with usrprt mode, and waits for the finish of capturing an image by select() call.
The interval of these finish logs was 29-35msec.

Thanks your suggestion about ZHAW HDMI/CSI board, too.
I know it, and have asked its price.
But I’m using HDMI/USB converter.
https://inogeni.com/4k-usb3-0/

This converter supports I420 format and UVC interface.
It can be configured via V4L2 ioctl().

Ok I see.
Did you have a look at the new encoding API in L4T24.2. This has been released 2 weeks ago. Nvidia claims to have improved the encoding API. There are several encoding examples.

P.S.

I tried to use gst-launch-1.0 to compare the encode frame rate,
but unfortunately some error occured.

  • video-frame.c:136:gst_video_frame_map_id: failed to map video frame plane 0
  • gstomxvideoenc.c:1798: Invalid input buffer size
  • gstomxvideoenc.c:2139: Failed to write input into the OpenMAX buffer
  • gstelement.c:1835: Could not write to resource
  1. CLIENT_IP=
  2. gst-launch-1.0 -v v4l2src device="/dev/video1" ! 'video/x-raw, width=(int)3840, height=(int)2160, format=(string)I420' | omxh264enc control-mode=2 bitrate=15000000 ! 'video/x-h264, stream-format=(string)byte-stream' ! h264parse ! rtph264pay mtu=1400 ! udpsink host=$CLIENT_IP port=5000 sync=false async=false

Sorry to duplicate.

Hi ShaneCCC@nVIDIA,

I’m grad to your reply.
https://devtalk.nvidia.com/default/topic/968216/jetson-tx1/nvidia-tx1-gstreamer-omxh265enc-fps/

Due to USB camera can’t use NVMM buffer that means the memory copy is necessary and it really heart the performance.

We have selected the Jetson TX1 platform to make a small 4k/30fps streaming server, because “Video encode: 4K 30Hz” is writen in its specification.
http://www.nvidia.com/object/jetson-tx1-module.html

We want to make a prototype to confirm that we obtain 4k/30fps streaming performance with Jetson TX1 development kit.
Unfortunately, the camera on the devkit has 5M resolution less than 4k.
And the other camera using the MIPI/CSI interface cannot use the ISP on TX1 to convert RAW (Bayer RGGB) to YUV.
Then we are using the USB 3.0 interface to input a 4k I420 video.

We have made C++ codes to V4L2 capture, to encode H.264 with OpenMAX and to send RTP packets.
And have already measured the time to copy between buffers ,to encode picture and to assemble and send packets.

When cudaHostAlloc() is used to get V4L2 UsrPtr capture buffer, memcpy() spent 50-60msec per 1 of 4k picture, we think that cuda have to be used instead of memcpy().

When malloc() (memalign) is used, memcpy() spent 10msec to copy from V4L2 buffer to OpenMAX buffer (using cudaHostAlloc).

It is over 40msec from enqueue (OpenMAX FillThisBuffer) to dequeue (OpenMAX H.264 FilledBuffer).
In each case, H.264 encoding spent over 40msec from enqueue to dequeue without memory copy.

I want to know the programing condition to get 4k/30fps encoding,

  • Alloccation method of the buffers
  • CPU and GPU clock setting
    etc.

Can I use NVMM in my C++ code ?

Regards,

Hi mynaemi

  1. Others sensor can use the ISP on TX1 if you don’t care about the image tuning stuff. But the sensor bring up maybe an effort.
  2. You can migrate to r24.2 to use multimedia api to use below api to use nv buffer.
int NvBufferCreate (int *dmabuf_fd, int width, int height, NvBufferLayout layout, NvBufferColorFormat colorFormat);
int NvBufferGetParams (int dmabuf_fd, NvBufferParams *params);
int NvBufferDestroy (int dmabuf_fd);

Hi ShaneCCC@nVIDIA,

I appreciate your support.

I’ve found “NvBufferDestroy()” and “createNvBuffer()” in sample source codes (R24.2).
But unfortunately I cannot see “NvBufferCreate()” nor “NvBufferGetParams()” in
(1) NVIDIA Tegra Multimedia API Framework Documentation (L4T R24.2)
(2) include/NvBuffer.h
(3) samples/common/classes/NvBuffer.cpp
(4) include/EGLStream/NV/ImageNativeBuffer.h
(5) gstomx1_src.tbz2

Would you please tell me what documents or source code I should refer ?

Best Regards,

You can download the r24.2 document from below link.
http://developer.nvidia.com/embedded/dlc/l4t-documentation-24-2


Hi ShaneCCC@nVIDIA,

Thanks very very very much!
I could read the API specification in the pointed document.
“Tegra Linux Driver Package Developer Guide” --> “Multimedia API Reference Documentation”

(1) Method NvBufferCreate() and NvBufferGetParams()

int NvBufferCreate (
 	int *  	dmabuf_fd,
	int  	width,
	int  	height,
	NvBufferLayout  	layout,
	NvBufferColorFormat  	colorFormat 
	) 		
Parameters
    [out]	dmabuf_fd	Returns dmabuf_fd of hardware buffer.
    [in]	width	Specifies the hardware buffer width, in bytes.
    [in]	height	Specifies the hardware buffer height, in bytes.
    [in]	layout	Specifies the layout of buffer.
    [in]	colorFormat	Specifies the colorFormat of buffer.
Returns
    0 for success, -1 for failure 

int NvBufferGetParams 	( 	int  	dmabuf_fd,
	NvBufferParams *  	params 
	) 		
Parameters
    [in]	dmabuf_fd	DMABUF FD of buffer.
    [out]	params	A pointer to the structure to fill with parameters.
Returns
    0 for success, -1 for failure.

(2) Data Structure _NvBufferParams (defined in nvbuf_utils.h)

uint32_t 	dmabuf_fd
void * 	nv_buffer
uint32_t 	nv_buffer_size
uint32_t 	pixel_format
uint32_t 	num_planes
uint32_t 	width [MAX_NUM_PLANES]
uint32_t 	height [MAX_NUM_PLANES]
uint32_t 	pitch [MAX_NUM_PLANES]
uint32_t 	offset [MAX_NUM_PLANES]

I’ll try it again on R24.2.

Best Regards,

Here are sample code to the gstreamer wish it be helpful.

cb_need_data (GstAppSrc *appsrc,
        guint       unused_size,
        gpointer    user_data)
{
    g_print("In %s : frame %d \n", __func__, ++fcount);
    static gboolean white = FALSE;
    static GstClockTime timestamp = 0;
    static int fd_init = 0 ;
    static int dmabuf_fd = 0;
    static NvU8 *data_mem;
    static NvBufferParams params;
    static NvBufferLayout layout = NvBufferLayout_Pitch;
    static NvBufferColorFormat colorFormat = NvBufferColorFormat_UYVY;

    GstBuffer *buffer=NULL;
    GstFlowReturn ret;
    GstMemory *inmem = NULL;

    /*  UYVY: 16 bits per pixel is 2 bytes per pixel , so x2 */
    int fsize = width * height * 2;

    // Just create nvmm buffer once
    if ( !fd_init )
    {
      ret = NvBufferCreate(&dmabuf_fd, width, height, layout, colorFormat);

      if (dmabuf_fd)
      {
        g_print ("NvBuffer Create SUCCESSFUL\n");
        fd_init = dmabuf_fd;
        /* to get hMem fd memory address */
        data_mem = mmap(0, fsize, PROT_READ | PROT_WRITE, MAP_SHARED, dmabuf_fd, 0);
        if ( data_mem == MAP_FAILED ) {
          g_print("Error : MAP_FAILED\n");
          g_main_loop_quit (loop);
        }
      } else
      {
         g_print ("Error : NvBuffer Create FAILED\n");
         g_main_loop_quit (loop);
      }

      ret = NvBufferGetParams (dmabuf_fd, &params);
      if (ret == 0)
        g_print ("Get NvBuffer Params SUCCESSFUL\n");
      else
      {
        g_print ("Error : Get NvBuffer params FAILED\n");
        g_main_loop_quit (loop);
        NvBufferDestroy(dmabuf_fd);
      }
    }
    /* this makes the image black/white */
    white = !white;
    memset ( data_mem , white ? 0xff : 0x00 , fsize);
    /* Allocate a new buffer that wraps the given memory */
    buffer = gst_buffer_new_wrapped_full(0 , (guint8 *) params.nv_buffer,
                 params.nv_buffer_size, 0, params.nv_buffer_size, NULL, NULL);
    /* notify nvvidconv plugin this is nvmmbuf format */
    inmem = gst_buffer_peek_memory(buffer, 0);
    inmem->allocator->mem_type = "nvcam";

    GST_BUFFER_PTS (buffer) = timestamp;
    GST_BUFFER_DURATION (buffer) = gst_util_uint64_scale_int (1, GST_SECOND, 4);
    timestamp += GST_BUFFER_DURATION (buffer);

    g_signal_emit_by_name (appsrc, "push-buffer", buffer, &ret);

    if (ret != GST_FLOW_OK) {
      gst_buffer_unref( buffer );
      NvBufferDestroy(dmabuf_fd);
      /* something wrong, stop pushing */
      g_main_loop_quit (loop);
    }
}

Hi ShaneCCC@nVIDIA,

I appreciate your support.

I’ve modified my C++ code to use NvBufferCreate(), after adding include path for “nvbuf_utils.h”.
Compiling has done, but linking was something wrong.

undefined reference to 'NvBufferCreate'
undefined reference to 'NvBufferGetParams'

Then I added “-lnvbuf_utils” to Makefile.

/usr/bin/ld: cannot find -lnvbuf_utils

What is lacking to configure my environment ?

You have to point the location of libnvbuf_utils.so.1.0 by the -L in the gcc parameter.

http://www.yolinux.com/TUTORIALS/LibraryArchives-StaticAndDynamic.html

Hi All,

I appreciate your supports.

“-L” and the library path has added ld command.
Then I’ve built my prototype C++ codes.

The encode time has become around 33msec, but a little less than 30 fps.

Best Regards,

Mynaemi,

Could you please share more info on your setup?

I am having issues capturing 4k at 30fps using the same inogeni HDMI to USB 3.0 contraption. I am using a TK1 for this instead of a TX1, but results are the same for me on both. I am capturing 3140x2160@30Hz frames from the hdmi output of a TX1, encoding in the TK1 using ffmpeg/avconv/gst-launch-1.0/gst-launch-0.10 directly from /dev/videoX. But only getting 8-9fps instead of close to 30fps. The 1080p capture, with the same setup cranks at >25fps.

Did you perform any kernel/driver modifications?

Thank you,

Octavio

Hi Octavio,

Unfotunately, I have no good information.
I’m using normal L4T R24.2 on our Jetson TX1s.
I have not modified nor configured the kernel and the USB3/V4L2 drivers.

My configuration of INOGENI via ioctl (v4l2 driver) is

  • 3840x2160@30fps
  • YUV 4:2:0 (I420)

Are you using TK1 ?
Jetpack 2.3 includes;

  • TX1: L4T R24.2 64bit
  • TK1: L4T R21.5

I don’t know the difference between TK1 R21.5 and TX1 R24.2.

I cannot understand;

ffmpeg/avconv/gst-launch-1.0/gst-launch-0.10
I think that in L4T R24.2, there is only GStreamer 1.0 (without GStreamer 0.10).

Could you please test the performance without ffmpeg elements?
Video Source --(HDMI)–> INOGENI --(USB3)–> Jetson TK1 --> /dev/videoN --> v4l2src --> some overlaysink
Does it perform to capture 30fps ?

And would you use nVIDIA System Profiler ?
I saw a capture thread (programmed by ourself) each 33msec (30fps/29-39msec).
I have not tuned OpenMAX H.264 encoder into 33msec (yet about 28-29fps).

Before using NvBuffer for the input buffer of H.264 encoder, I saw 15-20fps@3840x2160.

Best Regards,

mynaemi,

Thank you for answering. I am also using the same configuration as you for the inogeni: 3840x2160@30fps. I have updated the TK1 to the latest software/driver releases.

On the ffmpeg/avconv/gst-launch-1.0/gst-launch-0.10 line: I am sorry about the confussion, by that I meant that I get similar results by using ffmpeg, avconv, gst-launch-1.0 or gst-launch-0.10. The main difference is basically on CPU load, with gst-launch-0.10 with the nv_omx_h264enc being less CPU demanding.

The setup I have is:
Video Source --(HDMI)–> INOGENI --(USB3)–> Jetson TK1 --> /dev/videoN --> v4l2src --> some overlaysi

No matter what overlaysink (/dev/null, file, rtmp server) I get 8-9 fps using one of the command line utilities. However I wrote a c program to grab the frames (3840x2160) using v4l2 API. I can see an improvement to >15 fps (just grabbing) but still well under the theoretical 30fps. This is using mmap, will try using pinned memory next to see if there is an improvement.

Next I will investigate the encoding API options. I will need to publish the captured frames to a remote rtmp server. I can see you are using OpenMAX encoder API, Is there any reason for this?

Thank you again,

Octavio