Performance optimization help

dumbogeorge · November 19, 2018, 7:04am

Hi Folks,

We have following pipeline, where we are processing video from each camera on GPU/CPU and then selecting output from either camera to be encoded.

Cam1 -->--> GPU --> CPU (frame processing, and data annotation ) -->
                                                                   |
                                                                   |--> select either cam source --> encoder --> encoded bitstream
                                                                   | 
Cam2 -->--> GPU --> CPU (frame processing, and data annotation ) -->

When we operate each camera at 1080/30 - some of our CV (aaNewOCVConsumerThread, aaCamCaptureThread) routines eat up significant percentage of cycles. However when we operate each camera at 1080/60, then, some of nvidia resource manager API seem to take significant percentage of cycles (report from nvidia system profiler 3.9 attached).

Could someone please help - get to bottom of this ? Why nvidia resource manager takes long cycles ? What is going on in v4l2convert.so ? Is there a way to avoid some sort of conversion going on here ?

Thanks

dumbogeorge · November 19, 2018, 7:04am

What is best way to attach an image here ?
Thanks

DaneLLL · November 19, 2018, 8:38am

Hi,
Please break down to see which stage uses high loading of video converter.

Cam1 -->--> GPU --> CPU (frame processing, and data annotation ) -->



Cam2 -->--> GPU --> CPU (frame processing, and data annotation ) -->

Cam1 -->--> GPU -->



Cam2 -->--> GPU -->

Cam1 -->-->



Cam2 -->-->

You should be able to upload an image by clicking the image icon, no?

dumbogeorge · November 19, 2018, 8:53am

I get just “” when I click image icon.

Thanks

DaneLLL · November 19, 2018, 8:59am

Do you use Chrome?

dumbogeorge · November 19, 2018, 9:45am

Yes. Any otherway to share image - will dropbox work ?

DaneLLL · November 20, 2018, 2:22am

Hi,
You may use any free online services. We suggest you break down the pipeline to clarify which stage the high loading of video converter comes from.

dumbogeorge · November 21, 2018, 4:12am

Please find data pictures here -

PerfScreenShot1.png refers to entire pipeline described in #1.

PerfScreenShot2.png refers to following pipeline

Cam1 (1080p @ 60) ----> renderer

Please help optimize driver calls in v4l2convert.so calls.

Thanks

DaneLLL · November 21, 2018, 5:22am

Hi,
v4l2convert.so is a 3rdpary prebuilt library in v4l2 frameworks, not owned by NVIDIA. From the result, it looks to be triggered by high frame rate. In 60fps, you will see 60 capture request via v4l2_ioctl.

One thing you can check is the sensor mode and output resolution. Take onboard camera as an example, it supports 3 sensor modes:

PRODUCER: [0] W=2592 H=1944
PRODUCER: [1] W=2592 H=1458
PRODUCER: [2] W=1280 H=720

If you run

09_camera_jpeg_capture$ ./camera_jpeg_capture --disable-jpg --sensor-mode 0 --pre-res=1280x720 --cap-time 15

It puts loading on HW video converter doing 2592x1944 → 1280x720 conversion.

Otherwise, if you run

09_camera_jpeg_capture$ ./camera_jpeg_capture --disable-jpg --sensor-mode 2 --pre-res=1280x720 --cap-time 15

Making sensor mode same as output resolution can save the loading on HW video converter.

dumbogeorge · November 21, 2018, 9:46am

Hi DaneLLL

We are actually using native sensor resolution as our final output resolution, in our code. How can we zero-in on what is triggering an unsolicited conversion, and how can we get rid of it ?

I do not understand why would high fps trigger it ? Any idea what calls could be going on in v4l2convert.so ? or is there a quick and easy way for owner of this library to offer a “no operation” version of those APIs ? That could help in determining if those calls are unnecessary or not ?

Furthermore in order to imitate the simplified command line - of

Cam1 ---> renderer

I tried argus_camera app - which seems much more optimal. From the profiler result it seem that app is NOT making use of v4l2convert.so ? How can I find out what is leading upto call to v4l2convert.so in my code and eliminate it ?

Thanks,

DaneLLL · November 22, 2018, 1:31am

Hi,
The implementation of Argus SW stacks is:

sensor -> ioctl(VIDIOC_DQBUF) -> captured frames in raw format -> VI/ISP -> frames in I420/NV12 format

That’s why it shows higher loading in v4l2convert.so in 60fps. When comparing to 30fps, capture requests are double.

v4l2convert.so is in open source v4l2 stacks v4l-utils.git - media (V4L2, DVB and IR) applications and libraries

dumbogeorge · November 22, 2018, 4:44am

Hi DaneLLL

Would the pipeline that you have given, i.e.

sensor -> ioctl(VIDIOC_DQBUF) -> captured frames in raw format -> VI/ISP -> frames in I420/NV12 format

apply to your app - argus_camera (from tegra_multimedia_api) ?

I uploaded profile data from this app (without any video capture, but running 1080 @ 60 fps) at

https://www.dropbox.com/s/3ot6t7azlrdd7o7/PerScreenShot-argus-camera-app.png?dl=0

and I do not see any v4l2convert call here. Would it be easy for me to copy this examples for my pipeline - i.e. what I described in #1 ?

DaneLLL · November 22, 2018, 6:30am

Hi,
The pipeline is built in libargus.so. Your application should run the same capture pipeline.

We have source code of argus_camera at tegra_multimedia_api\argus\apps\camera. You may compare it with your application.
You may also download v4l-utils.git - media (V4L2, DVB and IR) applications and libraries , put debug prints, and rebuild v4l2convert.so to check where the loading is from. The branch is stable-1.0

dumbogeorge · December 1, 2018, 10:35am

Hi DaneLLL

I studied / copied tegra_multimedia_api/argus/apps/camera. Compared it with I have implemented. I am pretty much using same argus APIs which are being used in argus_camera app. While I am still working on this particular performance issue, I would like to ask a pipeline question related to mapping of buffers. I am suspecting that it could be leading to few unwanted conversion api calls.

My pipeline looks like -

Cam1 -->--> GPU --> CPU (frame processing, and data annotation )

More specifically -

Cam1 -->-->map-and-enqueue--> GPU --> CPU  (frame processing, and data annotation )  --->dequeue-and-unmap

I am wondering whether action of mapping an input frame, for storage in a queue could be causing any conversion of format ?

THis is what I do -

Acquire a frame, map it for further access on GPU and CPU.

UniqueObj<Frame> frame(iFrameConsumer->acquireFrame());
        IFrame *iFrame = interface_cast<IFrame>(frame);

        if (!iFrame)
            break;

        // Get the Frame's Image.
        Image *image = iFrame->getImage();

        IArgusCaptureMetadata *iArgusCaptureMetadata = interface_cast<IArgusCaptureMetadata>(frame);
        if (!iArgusCaptureMetadata)
            ORIGINATE_ERROR("Failed to get IArgusCaptureMetadata interface.");
        CaptureMetadata *metadata = iArgusCaptureMetadata->getMetadata();
        ICaptureMetadata *iMetadata = interface_cast<ICaptureMetadata>(metadata);
        if (!iMetadata)
            ORIGINATE_ERROR("Failed to get ICaptureMetadata interface.");

        EGLStream::NV::IImageNativeBuffer *iImageNativeBuffer
              = interface_cast<EGLStream::NV::IImageNativeBuffer>(image);
        TEST_ERROR_RETURN(!iImageNativeBuffer, "Failed to create an IImageNativeBuffer");

        // aaFrameBuffer is a data struct which encapsulates, few pointers and a NvBuffer.
        aaFrameBuffer *framedata = new aaFrameBuffer;
               framedata->framefd = iImageNativeBuffer->createNvBuffer(ARGUSSIZE {m_pCamInfo->liveParams.inputVideoInfo.width, m_pCamInfo->liveParams.inputVideoInfo.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);


		NvBufferGetParams(framedata->framefd, &(framedata->nvBufParams));
		framedata->fsizeY = framedata->nvBufParams.offset[1] + (framedata->nvBufParams.offset[2]-framedata->nvBufParams.offset[1])*2;
		framedata->fsizeU = framedata->nvBufParams.pitch[1] * framedata->nvBufParams.height[1] ;
		framedata->fsizeV = framedata->nvBufParams.pitch[2] * framedata->nvBufParams.height[2];

		m_pCamInfo->procInfo.pitchWidthY = framedata->nvBufParams.pitch[0];
		m_pCamInfo->procInfo.pitchWidthU = framedata->nvBufParams.pitch[1];
		m_pCamInfo->procInfo.pitchWidthV = framedata->nvBufParams.pitch[2];


		AACAM_CAPTURE_PRINT("4 Starting frame caputre  %d \n",m_currentFrame);

          	framedata->dataY = (char *)mmap(NULL, framedata->fsizeY, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[0]);
		framedata->dataU = (char *)mmap(NULL, framedata->fsizeU, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[1]);
		framedata->dataV = (char *)mmap(NULL, framedata->fsizeV, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[2]);

Frame from step 1, is put in Q.
GPU reads from the Q and processes it
CPU reads output of GPU and older frame from Q
After a delay of about 8 frames - a given frame is popped from Q , unmapped and destructed.

My question is whether mapping a frame fd could cause any conversion ?

Thanks,

DaneLLL · December 3, 2018, 1:43am

Hi,
Please use NvBufferMemMap() instead of mmap().
NvBufferMemMap() does not do any conversion.

dumbogeorge · December 3, 2018, 2:17am

Hi DaneLLL

Do you think that mmap could be the reason for some of v4l2convert.so calls I mentioned in #8 ?

Thanks

DaneLLL · December 3, 2018, 3:16am

Hi,
I am not sure about mmap() and v4l2convert.so. But for NvBuffer, you should use APIs defined in nvbuf_utils.h

/**
* This method must be used for hw memory cache sync for the CPU.
* @param[in] dmabuf_fd DMABUF FD of buffer.
* @param[in] plane video frame plane.
* @param[in] pVirtAddr Virtual Addres pointer of the mem mapped plane.
*
* @returns 0 for success, -1 for failure.
*/
int NvBufferMemSyncForCpu (int dmabuf_fd, unsigned int plane, void **pVirtAddr);

/**
* This method must be used for hw memory cache sync for device.
* @param[in] dmabuf_fd DMABUF FD of buffer.
* @param[in] plane video frame plane.
* @param[in] pVirtAddr Virtual Addres pointer of the mem mapped plane.
*
* @returns 0 for success, -1 for failure.
*/
int NvBufferMemSyncForDevice (int dmabuf_fd, unsigned int plane, void **pVirtAddr);

/**
* This method must be used for getting mem mapped virtual Address of the plane.
* @param[in] dmabuf_fd DMABUF FD of buffer.
* @param[in] plane video frame plane.
* @param[in] memflag NvBuffer memory flag.
* @param[in] pVirtAddr Virtual Addres pointer of the mem mapped plane.
*
* @returns 0 for success, -1 for failure.
*/
int NvBufferMemMap (int dmabuf_fd, unsigned int plane, NvBufferMemFlags memflag, void **pVirtAddr);

/**
* This method must be used to Unmap the mapped virtual Address of the plane.
* @param[in] dmabuf_fd DMABUF FD of buffer.
* @param[in] plane video frame plane.
* @param[in] pVirtAddr mem mapped Virtual Addres pointer of the plane.
*
* @returns 0 for success, -1 for failure.
*/
int NvBufferMemUnMap (int dmabuf_fd, unsigned int plane, void **pVirtAddr);

dumbogeorge · January 4, 2019, 10:24am

Hi DaneLLL

Would like to check about your answer #12

sensor -> ioctl(VIDIOC_DQBUF) -> captured frames in raw format -> VI/ISP -> frames in I420/NV12 format

Is there a way to avoid VIDIOC_DQBUF call ? I see this happening with argus_camera app too. Does each arrow here mean an read and write transaction to external memory (DRAM) ? That would seriously increase BW and degrade performance.

Is there a way for data to directly go to VI/ISP ? and not be routed via external memory ?

THanks

DaneLLL · January 7, 2019, 2:12am

Hi,
Argus frameworks is optimal and no extra memory copy. All operations are required for sensor frame capture. It has to take reasonable CPU/memory bandwidth.

So far we don’t have plan to support it.

Topic		Replies	Views
Usage of NvBuffer APIs Jetson TX1	36	15712	December 15, 2017
Argus high cpu usage streaming cameras Jetson TX2 camera	14	2838	October 18, 2021
Kernel crashes with v4l and Raspberry Pi camera Jetson Nano	18	2167	October 14, 2021
Frame rate drop during single node capture Jetson TX2 camera	38	2592	October 18, 2021
V4L2 get raw image but argus source fail Jetson Nano camera	8	781	October 15, 2021
[R35.1]Jetson Orin nvargus_nvraw can not capture image Jetson AGX Orin camera	17	1453	March 7, 2023
The AR0234 binocular camera was adapted with AGX ORIN in version R35.3.1, and the argus preview was not available Jetson AGX Orin camera	6	277	February 20, 2024
12_camera_v4l2_cuda problem with mPCIe V4L2 device Jetson TX2 mmapi	12	480	January 22, 2024
VI Engine crashing when camera source not delivered Jetson Xavier NX mmapi	9	747	August 18, 2023
Multiple cameras cause running slow of API Jetson TX1	19	2123	October 18, 2021

Performance optimization help

Related topics