How to achieve the H.264 encoding performance: 4K (3,840x2,160)/30fps with OpenMAX IL API/L4T R24.1

Hi mynaemi,opo,

I’m trying 4K30 capture via inogeni and HEVC encode streaming and I’m facing similar issue.

My system is as below

Video Source --(HDMI)–> INOGENI --(USB3)–> Jetson TX1 → /dev/videoN → v4l2src → Open Max H265 encode → network stream

The problem is that although capture is happening at realtime, the encode is sub-realtime. i.e 4K30 HEVC encode is taking around 44 ms for each frame , its resulting in 22-24 fps.

As suggested in this thread, I’m trying to use the NVMM buffer, but I’m facing some issues with that.

I’m doing NvBuffer create as below

NvBufferCreate(&buffer_fd,\
                       3840,\
                       2160,\
                       NvBufferLayout_Pitch,\
                       NvBufferColorFormat_NV12);

if (buffer_fd)
{
	printf("NvBuffer Create SUCCESSFUL\n");

        pv_buffer = mmap(0, (3840 * 2160 * 3) >> 2, \
				 PROT_READ | PROT_WRITE, MAP_SHARED, buffer_fd, 0);

	if ( pv_buffer == MAP_FAILED )
	{
		printf("Error : MAP_FAILED\n");
	}
	else
	{
		printf("pvbuf - %p\n",pv_buffer );
	}
}

pv_buffer when used with Open MAX encoder returns this error -

VENC: NvMMLiteVideoEncDoWork: 2572: BlockSide error 0x2
Event_BlockError from 0BlockHevcEnc : Error code - 2
Sending error event from 0BlockHevcEnc

I’m stuck with this and not able to proceed further. Please help.

Hi, zeitgeist

I’ve already changed “nVidia Multimedia API” from “OpenMAX IL API”.
And I’m trying H.264 encode.
Sorry, I cannot suggest any comment.

Would you wait for any advise from nVidia engineer ?

Hi zeitgeist
Please reference to the below link to get the reference code.

https://devtalk.nvidia.com/default/topic/984850/jetson-tx1/how-to-convert-yuv-to-jpg-using-jpeg-encoder-hardware-/post/5048479/#5048479

Hi ShaneCCC,

I tried the link you pointed to. There is a new nvbuf_utils library and also some sample code.

But NV buffer create is still an issue for me . Below is the code snippet

for(i = 0; i < 10; i++)
{
	int width  = 3840;
	int height = 2160;
	int size   = (3840 * 2160 * 3) >> 2;

	 printf("--------- Loop : %d --------\n",i);
	
	if (-1 == NvBufferCreate(&fd,\
			   width,\
			   height,\
			   NvBufferLayout_Pitch,\
			   NvBufferColorFormat_NV12))
	{
		printf("Failed to create NV buffer\n");
	}

	if (fd)
	{
		printf("NvBuffer Create SUCCESSFUL\n");

		/* to get hMem fd memory address */
		pv_buffer = \
				mmap(0,\
					 size, \
					 PROT_READ | PROT_WRITE, \
					 MAP_SHARED,\
					 fd,
					 0);

		if ( pv_buffer == MAP_FAILED )
		{
			printf("Error : MAP_FAILED\n");
		}
		else
		{
			printf("pvbuf - %p\n",pv_buffer );
		}
	}
	else
	{
		printf ("Error : NvBuffer Create FAILED\n");
	}

	NvBufferParams params;

	ret = NvBufferGetParams (fd, &params);
	if (ret == 0)
	{
		printf ("Get NvBuffer Params SUCCESSFUL\n");

		printf("FD        0x%x\n", params.dmabuf_fd);
		printf("NVBuffer  %p\n", params.nv_buffer);
		printf("NVB Size  %u\n", params.nv_buffer_size);
		printf("Format    %d\n", params.pixel_format);
		printf("NumPlanes %d\n", params.num_planes);
		printf("width[0]  %d\n", params.width[0]);
		printf("height[0] %d\n", params.height[0]);
		printf("pitch[0]  %d\n", params.pitch[0]);
		printf("offset[0] %d\n", params.offset[0]);
		printf("width[1]  %d\n", params.width[1]);
		printf("height[1] %d\n", params.height[1]);
		printf("pitch[1]  %d\n", params.pitch[1]);
		printf("offset[1] %d\n", params.offset[1]);

	}
	else
	{
		printf ("Error : Get NvBuffer params FAILED\n");
		NvBufferDestroy(fd);
	}
}

My output is -

--------- Loop : 0 --------
NvBuffer Create SUCCESSFUL

<b>pvbuf - 0x7f7de11000</b>

Get NvBuffer Params SUCCESSFUL
FD        0x428
NVBuffer  0x7f780008c0
NVB Size  776
Format    2
NumPlanes 2
width[0]  3840
height[0] 2160
pitch[0]  3840
offset[0] 0
width[1]  1920
height[1] 1080
pitch[1]  3840
offset[1] 8388608
                        
--------- Loop : 1 --------
NvBuffer Create SUCCESSFUL

<b>pvbuf - 0x7f7d822000</b>

Get NvBuffer Params SUCCESSFUL
FD        0x429
NVBuffer  0x7f78000c10
NVB Size  776
Format    2
NumPlanes 2
width[0]  3840
height[0] 2160
pitch[0]  3840
offset[0] 0
width[1]  1920
height[1] 1080
pitch[1]  3840
offset[1] 8388608

--------- Loop : 2 --------
NvBuffer Create SUCCESSFUL

<b>pvbuf - 0x7f7d233000</b>

Get NvBuffer Params SUCCESSFUL
FD        0x42a
NVBuffer  0x7f78000f60
NVB Size  776
Format    2
NumPlanes 2
width[0]  3840
height[0] 2160
pitch[0]  3840
offset[0] 0
width[1]  1920
height[1] 1080
pitch[1]  3840
offset[1] 8388608

.....

I’m not able to understand how the gap between two mmap pointers be less than the buffer size i.e.

pvbuf 1 - ( 0x7f7de11000, size - 12441600 ) and pvbuf 2 - (0x7f7d822000 , size - 12441600 ) and

(pvbuf 1 - pvbuf 2) < size

(0x7f7de11000 - 0x7f7d822000) < 12441600

6221824 < 12441600

The real buffer is the point of data.

class NvBuffer
 {
 public:
     typedef struct
     {
         uint32_t width;             
         uint32_t height;            
         uint32_t bytesperpixel;     
         uint32_t stride;            
         uint32_t sizeimage;         
     } NvBufferPlaneFormat;
 
     typedef struct
     {
         NvBufferPlaneFormat fmt;    
        [s] [u]unsigned char *data;
[/u]       
         uint32_t bytesused;         
         int fd;                     
         uint32_t mem_offset;        
         uint32_t length;            
     } NvBufferPlane;

You seem to be referring to NvBuffer.h which has the NvBuffer class. I was referring to nvbuf_utils.h which has functions to allocate hw buffers. I’m facing problem in allocating and using the hardware buffers as described above.

Hi zeitgeist,
You are able to do h265 encoding and cuda processing by referring to the sample codes:
[url]https://devtalk.nvidia.com/default/topic/984850/jetson-tx1/how-to-convert-yuv-to-jpg-using-jpeg-encoder-hardware-/post/5048479/#5048479[/url]
tegra_multimedia_api/samples/01_video_encode
tegra_multimedia_api/samples/03_video_cuda_enc

Have you seen issues when integrating the samples to your case?

Hi DaneLLL,

I’m not able to get the required throughput of 30 fps. So I’m trying out multiple options (That’s why the NVMM/Hw buffer). I tried to isolate the issue by running 01_video_encode sample application, but no luck.

Since it is deviating from the topic of this thread, I have started another thread -

[url]https://devtalk.nvidia.com/default/topic/994281/v4l2-video-encoder-performance-/#5086342[/url]

I’m able to get 4K encode at 30 fps for H.264 , but not for H.265 (it is around 22-24 fps).

Let’s continue on https://devtalk.nvidia.com/default/topic/994281/

Hi, zeitgeist,

You wrote:
I’m able to get 4K encode at 30 fps for H.264

How to get to encode 4K at 30fps for H.264 ?
OpenMAX IL ? or MMAPI ?

Using the MMAPI, ioctl() in qBuffer() is blocked until done of encoding.
The encoding time is less than 33msec, but the system performance will not be 30fps.
I’m waiting fot the next release from nVIDIA.

https://devtalk.nvidia.com/default/topic/987024/
DaneLLL wrote:
We can observe the issue and are checking the plan of making improvement on future releases.

Please refer to https://devtalk.nvidia.com/default/topic/994281/jetson-tx1/v4l2-video-encoder-performance-/post/5090266/#5090266

Hi DaneLLL,

I appreciate your suggestion.

I’m trying the MM_API_ADD_ENCODER_HW_PRESET.zip patch with L4T R24.2.1.

In the GStreamer document,
http://developer.download.nvidia.com/embedded/L4T/r24_Release_v2.1/Docs/Accelerated_GStreamer_User_Guide_Release_24.2.1.pdf?autho=1488276147_7986022c09a2221ecce45fcc0ee3582a&file=Accelerated_GStreamer_User_Guide_Release_24.2.1.pdf

  • Fast: Only Integer Pixel (integer-pel) block motion is estimated.
  • Medium: Supports up to Half Pixel (half-pel) block motion estimation.
  • Slow: Supports up to Quarter Pixel (Qpel) block motion estimation.

** UltraFastPreset mode has no description.

I think that fast/medium mode is not enough image quality as H.264/265 encoding.
Should we select the “UltraFast” mode for 4k/30fps ?

Hi mynaemi,
Yes, please select ‘UltraFast’ for 4kp30.

Please also refer to:
[url]https://devtalk.nvidia.com/default/topic/994281/jetson-tx1/v4l2-video-encoder-performance-/post/5092850/#5092850[/url]

Hi DaneLLL,

Thanks for your response.
It’s too bad…

What is difference between “Fast” and “UltraFast” influencing image quality ?
(In the document, UltraFastPreset mode has no description.)

Hi mynaemi,
The detail is about HW implementation.
Please compare the video quality by doing snapshots in encoded H264/HEVC stream like
[url]https://devtalk.nvidia.com/default/topic/932754/jetson-tx1/more-clarity-on-h-265-encode-parameters-/post/5037363/#5037363[/url]

Hi,

I’m facing with another issue with 4K30 USB capture. I’m trying to realize 4K30 capture → encode → stream. I see that the USB3.0 capture affects the throughput of the Ethernet, i.e when 4K30 is being captured Ethernet bandwidth reduces drastically, due to which I’m not able to achieve video streaming.

These are my observations:

When I run iperf this is the output

iperf -c 192.168.39.9 -u -b 1000M -i 1
------------------------------------------------------------
Client connecting to 192.168.39.9, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.49.12 port 42333 connected with 192.168.39.9 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  50.4 MBytes   423 Mbits/sec
[  3]  1.0- 2.0 sec  50.5 MBytes   424 Mbits/sec
[  3]  2.0- 3.0 sec  51.1 MBytes   429 Mbits/sec
:
:
[  3]  0.0-10.0 sec   505 MBytes   424 Mbits/sec
[  3] Sent 360266 datagrams
[  3] Server Report:
[  3]  0.0-10.2 sec   114 MBytes  93.8 Mbits/sec  13.993 ms 278688/360265 (77%)

When I run v4l2 capture as follows and then run iperf simultaneously the throughput reduces

v4l2-ctl --device /dev/video0 --set-fmt-video=width=3840,height=2160,pixelformat=NV12
v4l2-ctl --device /dev/video0 --stream-mmap  --stream-poll
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 31.37 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 30.50 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 30.36 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 30.25 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 30.20 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 30.18 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 30.15 fps

Tegrastats -
RAM 1747/3995MB (lfb 303x4MB) cpu [89%,1%,1%,0%]@1734 EMC 14%@1600 AVP 1%@80 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734
RAM 1747/3995MB (lfb 303x4MB) cpu [85%,0%,0%,0%]@1734 EMC 14%@1600 AVP 1%@80 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734
RAM 1747/3995MB (lfb 303x4MB) cpu [86%,0%,1%,0%]@1734 EMC 14%@1600 AVP 1%@80 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734

Now run iperf 
[  3] local 192.168.49.12 port 56749 connected with 192.168.39.9 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  1.27 MBytes  10.6 Mbits/sec
[  3]  1.0- 2.0 sec  1.12 MBytes  9.43 Mbits/sec
[  3]  2.0- 3.0 sec  1.66 MBytes  13.9 Mbits/sec
:
:
[  3]  9.0-10.0 sec  1.12 MBytes  9.38 Mbits/sec
[  3]  0.0-10.0 sec  14.1 MBytes  11.8 Mbits/sec
[  3] Sent 10056 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec  14.1 MBytes  11.8 Mbits/sec   0.736 ms 

And tegrastats output is 

RAM 1747/3995MB (lfb 303x4MB) cpu [100%,0%,1%,1%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734
RAM 1747/3995MB (lfb 303x4MB) cpu [100%,0%,0%,4%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734
RAM 1747/3995MB (lfb 303x4MB) cpu [100%,0%,2%,8%]@1734 EMC 14%@1600 AVP 0%@80 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734

The ethernet throughput dropped from ~100Mpbs to ~10Mpbs.
I have run the jetson max clocks script as well, but no improvement.

I found out that ethernet controller on TX1 is RTL8153 which is also via USB3.0.

Can this be leading to this issue?
Is there a way to achieve 4K30 USB capture and streaming on TX1?

Maybe the USB and Ethernet ports actually share the same controller and bus, so if USB troughput is heavy, then Ethernet would be affected.

So if we could use 2 boards, the throughput would be double :)

Raspberry Pi also has this design/issue.

Please refer to
[url]https://devtalk.nvidia.com/default/topic/1001154/jetson-tx1/ethernet-speed-issue-for-4k-usb-capture-and-streaming-/post/5192499/#5192499[/url]

I am trying to access Nvbuffer memory and convert it to opencv mat.
I am able to do the conversion but in the process of doing show the image is getting converted into a grey scale one with messed up dimensions. But if I do a ofstream file write with m_JpegEncoder I am getting proper image with right dimensions.

Here is piece of code:

while (true)
    {
        for (uint32_t i = 0; i < m_streams.size(); i++)
        {
            // Acquire a frame.
            UniqueObj<Frame> frame(iFrameConsumers[i]->acquireFrame());
            IFrame *iFrame = interface_cast<IFrame>(frame);
            if (!iFrame)
                break;

            // Get the IImageNativeBuffer extension interface.
            NV::IImageNativeBuffer *iNativeBuffer =
                interface_cast<NV::IImageNativeBuffer>(iFrame->getImage());
            if (!iNativeBuffer)
                ORIGINATE_ERROR("IImageNativeBuffer not supported by Image.");

            // If we don't already have a buffer, create one from this image.
            // Otherwise, just blit to our buffer.
            if (!m_dmabufs[i])
            {   
                
                m_dmabufs[i] = iNativeBuffer->createNvBuffer(iEglOutputStreams[i]->getResolution(),
                                                          NvBufferColorFormat_YUV420,
                                                          NvBufferLayout_Pitch);
                if (!m_dmabufs[i])
                    CONSUMER_PRINT("\tFailed to create NvBuffer\n");
                

            }
            else if (iNativeBuffer->copyToNvBuffer(m_dmabufs[i]) != STATUS_OK)
            {
                ORIGINATE_ERROR("Failed to copy frame to NvBuffer.");
            }
            
        }


//

if (m_streams.size() > 1)
        {
            // Composite multiple input to one frame

            NvBufferComposite(m_dmabufs, m_compositedFrame, &m_compositeParam);

            // [b]THIS WORKS BUT I AM GETTING A IMAGE THAT IS IN GRAYSCALE
            // WITH DIMENSION MESSED UP[/b]
            NvBufferParams params;
            NvBufferGetParams(m_compositedFrame, &params);
            void *ptr_y;
            uint8_t *ptr_cur;
            int i, j, a, b;
            NvBufferMemMap(m_compositedFrame, Y_INDEX, NvBufferMem_Write, &ptr_y);
            NvBufferMemSyncForCpu(m_compositedFrame, Y_INDEX, &ptr_y);
            ptr_cur = (uint8_t *)ptr_y + params.pitch[Y_INDEX]*START_POS + START_POS;
            char *data_mem = (char*)ptr_cur;
            cv::Mat imgbuf = cv::Mat(1920, 3840, CV_8UC3, data_mem, 3840);
            std::cout<<"Img Buffer\n"<<imgbuf.dims<<"\n";
            cv::imshow("img", imgbuf);

            // <b>THIS WRITES IMAGE PERFECTLY</b>

            unsigned char *buffer;
            char* data;
            std::ofstream *outputFile = new std::ofstream(filename.c_str());
            struct Result cinfo;
            unsigned long size = 1920*1920*2;
            if (outputFile)
            {
                
                buffer = m_OutputBuffer;
                cinfo = m_JpegEncoder->encodeFromFd(m_compositedFrame, JCS_YCbCr, &buffer, size);
                outputFile->write((char *)cinfo.buf, cinfo.size);
                std::string str(cinfo.buf, cinfo.buf + cinfo.size);
                delete outputFile;

                

            }



}

I have also tried mmap approch that did not worked either.
I know about NvEglRenderer and it is working perfectly, but I need the buffer in an opencv mat

This is mmap approch but yeilds the same GRAYSCALE IMAGE with dimension messed up

// NvBufferParams params0;
            // NvBufferGetParams(m_compositedFrame, &params0);
            // int fsize0 = params0.pitch[0] * 1920;
            // char *data_mem0 = (char*)mmap(0, 1920*1920*2, PROT_WRITE, MAP_SHARED, m_compositedFrame, params0.offset[0]);
            // if (data_mem0 == MAP_FAILED)
            //     printf("mmap failed : %s\n", strerror(errno));
            // cv::Mat imgbuf0 = cv::Mat(1920, 1920*2, CV_8UC3, data_mem0, 1920*2);
            // cv::imshow("img0", imgbuf0);

I have also tried using converting char* buffer into a std::vector and then use that with cv::imdecode to get a opencv Mat but that did not worked either and I get a opencv expection.

If there is any way to do this please let me know.

Hi tripathy.devi7,
Your issue is different from this thread. Please make a new post.