Performance optimization help

Hi Folks,

We have following pipeline, where we are processing video from each camera on GPU/CPU and then selecting output from either camera to be encoded.

Cam1 -->--> GPU --> CPU (frame processing, and data annotation ) -->
                                                                   |
                                                                   |--> select either cam source --> encoder --> encoded bitstream
                                                                   | 
Cam2 -->--> GPU --> CPU (frame processing, and data annotation ) -->

When we operate each camera at 1080/30 - some of our CV (aaNewOCVConsumerThread, aaCamCaptureThread) routines eat up significant percentage of cycles. However when we operate each camera at 1080/60, then, some of nvidia resource manager API seem to take significant percentage of cycles (report from nvidia system profiler 3.9 attached).

Could someone please help - get to bottom of this ? Why nvidia resource manager takes long cycles ? What is going on in v4l2convert.so ? Is there a way to avoid some sort of conversion going on here ?

Thanks

What is best way to attach an image here ?
Thanks

Hi,
Please break down to see which stage uses high loading of video converter.

Cam1 -->--> GPU --> CPU (frame processing, and data annotation ) -->



Cam2 -->--> GPU --> CPU (frame processing, and data annotation ) -->
Cam1 -->--> GPU -->



Cam2 -->--> GPU -->
Cam1 -->-->



Cam2 -->-->

You should be able to upload an image by clicking the image icon, no?

I get just “” when I click image icon.

Thanks

Do you use Chrome?

Yes. Any otherway to share image - will dropbox work ?

Hi,
You may use any free online services. We suggest you break down the pipeline to clarify which stage the high loading of video converter comes from.

Please find data pictures here -

https://www.dropbox.com/sh/6ov6fuox4ialqjr/AABtM-ISYKa_v733pubJ2-KAa?dl=0

PerfScreenShot1.png refers to entire pipeline described in #1.

PerfScreenShot2.png refers to following pipeline

Cam1 (1080p @ 60) ----> renderer

Please help optimize driver calls in v4l2convert.so calls.

Thanks

Hi,
v4l2convert.so is a 3rdpary prebuilt library in v4l2 frameworks, not owned by NVIDIA. From the result, it looks to be triggered by high frame rate. In 60fps, you will see 60 capture request via v4l2_ioctl.

One thing you can check is the sensor mode and output resolution. Take onboard camera as an example, it supports 3 sensor modes:

PRODUCER: [0] W=2592 H=1944
PRODUCER: [1] W=2592 H=1458
PRODUCER: [2] W=1280 H=720

If you run

09_camera_jpeg_capture$ ./camera_jpeg_capture --disable-jpg --sensor-mode 0 --pre-res=1280x720 --cap-time 15

It puts loading on HW video converter doing 2592x1944 -> 1280x720 conversion.

Otherwise, if you run

09_camera_jpeg_capture$ ./camera_jpeg_capture --disable-jpg --sensor-mode 2 --pre-res=1280x720 --cap-time 15

Making sensor mode same as output resolution can save the loading on HW video converter.

Hi DaneLLL

We are actually using native sensor resolution as our final output resolution, in our code. How can we zero-in on what is triggering an unsolicited conversion, and how can we get rid of it ?

I do not understand why would high fps trigger it ? Any idea what calls could be going on in v4l2convert.so ? or is there a quick and easy way for owner of this library to offer a “no operation” version of those APIs ? That could help in determining if those calls are unnecessary or not ?

Furthermore in order to imitate the simplified command line - of

Cam1 ---> renderer

I tried argus_camera app - which seems much more optimal. From the profiler result it seem that app is NOT making use of v4l2convert.so ? How can I find out what is leading upto call to v4l2convert.so in my code and eliminate it ?

Thanks,

Hi,
The implementation of Argus SW stacks is:

sensor -> ioctl(VIDIOC_DQBUF) -> captured frames in raw format -> VI/ISP -> frames in I420/NV12 format

That’s why it shows higher loading in v4l2convert.so in 60fps. When comparing to 30fps, capture requests are double.

v4l2convert.so is in open source v4l2 stacks https://git.linuxtv.org/v4l-utils.git

Hi DaneLLL

Would the pipeline that you have given, i.e.

sensor -> ioctl(VIDIOC_DQBUF) -> captured frames in raw format -> VI/ISP -> frames in I420/NV12 format

apply to your app - argus_camera (from tegra_multimedia_api) ?

I uploaded profile data from this app (without any video capture, but running 1080 @ 60 fps) at

https://www.dropbox.com/s/3ot6t7azlrdd7o7/PerScreenShot-argus-camera-app.png?dl=0

and I do not see any v4l2convert call here. Would it be easy for me to copy this examples for my pipeline - i.e. what I described in #1 ?

Hi,
The pipeline is built in libargus.so. Your application should run the same capture pipeline.

We have source code of argus_camera at tegra_multimedia_api\argus\apps\camera. You may compare it with your application.
You may also download https://git.linuxtv.org/v4l-utils.git , put debug prints, and rebuild v4l2convert.so to check where the loading is from. The branch is stable-1.0

Hi DaneLLL

I studied / copied tegra_multimedia_api/argus/apps/camera. Compared it with I have implemented. I am pretty much using same argus APIs which are being used in argus_camera app. While I am still working on this particular performance issue, I would like to ask a pipeline question related to mapping of buffers. I am suspecting that it could be leading to few unwanted conversion api calls.

My pipeline looks like -

Cam1 -->--> GPU --> CPU (frame processing, and data annotation )

More specifically -

Cam1 -->-->map-and-enqueue--> GPU --> CPU  (frame processing, and data annotation )  --->dequeue-and-unmap

I am wondering whether action of mapping an input frame, for storage in a queue could be causing any conversion of format ?

THis is what I do -

  1. Acquire a frame, map it for further access on GPU and CPU.
UniqueObj<Frame> frame(iFrameConsumer->acquireFrame());
        IFrame *iFrame = interface_cast<IFrame>(frame);

        if (!iFrame)
            break;

        // Get the Frame's Image.
        Image *image = iFrame->getImage();

        IArgusCaptureMetadata *iArgusCaptureMetadata = interface_cast<IArgusCaptureMetadata>(frame);
        if (!iArgusCaptureMetadata)
            ORIGINATE_ERROR("Failed to get IArgusCaptureMetadata interface.");
        CaptureMetadata *metadata = iArgusCaptureMetadata->getMetadata();
        ICaptureMetadata *iMetadata = interface_cast<ICaptureMetadata>(metadata);
        if (!iMetadata)
            ORIGINATE_ERROR("Failed to get ICaptureMetadata interface.");

        EGLStream::NV::IImageNativeBuffer *iImageNativeBuffer
              = interface_cast<EGLStream::NV::IImageNativeBuffer>(image);
        TEST_ERROR_RETURN(!iImageNativeBuffer, "Failed to create an IImageNativeBuffer");

        // aaFrameBuffer is a data struct which encapsulates, few pointers and a NvBuffer.
        aaFrameBuffer *framedata = new aaFrameBuffer;
               framedata->framefd = iImageNativeBuffer->createNvBuffer(ARGUSSIZE {m_pCamInfo->liveParams.inputVideoInfo.width, m_pCamInfo->liveParams.inputVideoInfo.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);


		NvBufferGetParams(framedata->framefd, &(framedata->nvBufParams));
		framedata->fsizeY = framedata->nvBufParams.offset[1] + (framedata->nvBufParams.offset[2]-framedata->nvBufParams.offset[1])*2;
		framedata->fsizeU = framedata->nvBufParams.pitch[1] * framedata->nvBufParams.height[1] ;
		framedata->fsizeV = framedata->nvBufParams.pitch[2] * framedata->nvBufParams.height[2];

		m_pCamInfo->procInfo.pitchWidthY = framedata->nvBufParams.pitch[0];
		m_pCamInfo->procInfo.pitchWidthU = framedata->nvBufParams.pitch[1];
		m_pCamInfo->procInfo.pitchWidthV = framedata->nvBufParams.pitch[2];


		AACAM_CAPTURE_PRINT("4 Starting frame caputre  %d \n",m_currentFrame);

          	framedata->dataY = (char *)mmap(NULL, framedata->fsizeY, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[0]);
		framedata->dataU = (char *)mmap(NULL, framedata->fsizeU, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[1]);
		framedata->dataV = (char *)mmap(NULL, framedata->fsizeV, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[2]);
  1. Frame from step 1, is put in Q.

  2. GPU reads from the Q and processes it

  3. CPU reads output of GPU and older frame from Q

  4. After a delay of about 8 frames - a given frame is popped from Q , unmapped and destructed.

My question is whether mapping a frame fd could cause any conversion ?

Thanks,

Hi,
Please use NvBufferMemMap() instead of mmap().
NvBufferMemMap() does not do any conversion.

Hi DaneLLL

Do you think that mmap could be the reason for some of v4l2convert.so calls I mentioned in #8 ?

Thanks

Hi,
I am not sure about mmap() and v4l2convert.so. But for NvBuffer, you should use APIs defined in nvbuf_utils.h

/**
* This method must be used for hw memory cache sync for the CPU.
* @param[in] dmabuf_fd DMABUF FD of buffer.
* @param[in] plane video frame plane.
* @param[in] pVirtAddr Virtual Addres pointer of the mem mapped plane.
*
* @returns 0 for success, -1 for failure.
*/
int NvBufferMemSyncForCpu (int dmabuf_fd, unsigned int plane, void **pVirtAddr);

/**
* This method must be used for hw memory cache sync for device.
* @param[in] dmabuf_fd DMABUF FD of buffer.
* @param[in] plane video frame plane.
* @param[in] pVirtAddr Virtual Addres pointer of the mem mapped plane.
*
* @returns 0 for success, -1 for failure.
*/
int NvBufferMemSyncForDevice (int dmabuf_fd, unsigned int plane, void **pVirtAddr);

/**
* This method must be used for getting mem mapped virtual Address of the plane.
* @param[in] dmabuf_fd DMABUF FD of buffer.
* @param[in] plane video frame plane.
* @param[in] memflag NvBuffer memory flag.
* @param[in] pVirtAddr Virtual Addres pointer of the mem mapped plane.
*
* @returns 0 for success, -1 for failure.
*/
int NvBufferMemMap (int dmabuf_fd, unsigned int plane, NvBufferMemFlags memflag, void **pVirtAddr);

/**
* This method must be used to Unmap the mapped virtual Address of the plane.
* @param[in] dmabuf_fd DMABUF FD of buffer.
* @param[in] plane video frame plane.
* @param[in] pVirtAddr mem mapped Virtual Addres pointer of the plane.
*
* @returns 0 for success, -1 for failure.
*/
int NvBufferMemUnMap (int dmabuf_fd, unsigned int plane, void **pVirtAddr);

Hi DaneLLL

Would like to check about your answer #12

sensor -> ioctl(VIDIOC_DQBUF) -> captured frames in raw format -> VI/ISP -> frames in I420/NV12 format

Is there a way to avoid VIDIOC_DQBUF call ? I see this happening with argus_camera app too. Does each arrow here mean an read and write transaction to external memory (DRAM) ? That would seriously increase BW and degrade performance.

Is there a way for data to directly go to VI/ISP ? and not be routed via external memory ?

THanks

Hi,
Argus frameworks is optimal and no extra memory copy. All operations are required for sensor frame capture. It has to take reasonable CPU/memory bandwidth.

So far we don’t have plan to support it.