Finding the bottleneck in video stitching application

Hi,

my team is working on a video stitching application.
We are using 5 CSI cameras and stitch them to one 180° panorama image.
While the application definitely needs some performance improvement, I am having trouble to profile the bottleneck of the application.

Here some details:

  • We can grab 5 cameras at 30 FPS (using gstreamer)
  • We can run the stitching algorithm with 30 FPS (only with 5 videotestsrc)
  • If we feed the cameras into the stitching algorithm we end up at 15 FPS
  • ARM/GPU load is at 60-70% for camera -> stitching -> encoding pipeline at 15 fps

The problem is that we cannot clearly see what hardware is limiting here. A short look at tegrastats looks fine:

RAM 2703/3995MB (lfb 142x4MB) cpu [72%,70%,63%,65%]@1734 EMC 46%@1600 AVP 2%@12 NVDEC 192 MSENC 192 GR3D 33%@998 EDP limit 1734

Can somebody give me some hints how I can find the bottleneck? I am suspecting some kind of memory bandwidth issue but how can I proof this?

Is this possible with Tegra System Profiler?

Kind regards,
Christian

Hi crossfire
How do you pass the frame data to your algorithm. The memory copy could take time to heart performance.

thre camera streaming part is usually delay sensitive. if your CPU is busy with some computation task and miss the proper window to stream THIS frame, then it has to wait for NEXT frame to come. If camera is supposed to stream at 30fps, then one frame time is 33ms, the gstreamer task must be scheduled shorter than 33ms to grab every frame.
usually you may want to use multithreading and multi CPU core to ensure real time performance.

another observation is that with 15fps your CPU load is already 60~70% percent, that means if your pipleine runs at 30fps you CPU load will be 120~140% percent. You are actually overloading your computer.

are you using a network camera? if so, you can use wireshark tool (www.wireshark.org) to analyze the timing for streaming. compare the difference before and after engaging the CPU pipeline.

First, thank you for your response,

I am using gstreamer appsink to grab the frames from a nvcamerasrc. This way I can easily exchange the nvcamerasrc with oher gstreamer elements (e.g. videotestsrc) to test the performance of the algorithm.
The data is then copied from the gstreamer provided memory to CUDA allocated memory with cudaMemcpy2D.
I got 30 FPS with videotestsrc and 15 with nvcamerasrc.
I just have to find out which resources the cameras are using and whats the bottleneck.

The appsink has an integrated buffer that fills when the frames are not pulled out of the appsink. I don’t think that this is the problem here

I am already using multithreading. I have one thread that just copies the data to GPU allocated memory, one thread that calls the cuda kernel for stitching and one thread that copies the data back to gstreamer for encoding.

I am getting 15fps when I want to achieve 30fps. I think that is a difference. Also, my application is not CPU intensive since I am using CUDA for the stitching. And I can get 30 FPS with videotestsrc

I am using CSI MIPI cameras

Hi crossfire
I think you need to profile all elements to know how much time they are consumed. Usually the frame drop cause by the queue stock by the element who take much time to process.

Thank you Shane.

The problem here is how to find out which resources are overloaded.
Can you give some hints, tutorials, etc how I could use tegra system profiler to find the bottleneck?

Also does anyone experience some frame drops when using nvcamerasrc?

Hi
You can add time stamp for each step starting and ending and print it out to check which step take much time.

Hi Shane,
I spend some time to take a look into Tegra System Profiler.
I compared my videotestsrc version with the nvcamerasrc version and found out that nvvidconv seems to be actually the problem.

Both versions creating the same ammount of data.

This is the CPU load of the videotesrc:


And this the load for the nvcamerasrc:

It seems that the nvvidconv element needs a lot of performance. Even more than a videotestsrc, which must generate the frame first.

Can you explain this behavior?

Also the name copy_to_user looks like there happens a lot of memory copies which I prevously was suspecting as bottleneck.

A look at the timeline gives more information:
This is how the timeline for videotestsrc looks like:


And this is for nvcamerasrc:

The first one shows that the frames could be easily processed in sequential order. In the nvcamerasrc example, it seems that memory_copy and kernel executions need much more time and they start to overlap. Which is actually good but shows there is a huge slowdown.
The kernel execution time has nearly doubled for nvcamerasrc.

The second image shows that the HtoD and DtoH copies seems to block each other. This is because the memory copied to/from host is not page locked, which does not support concurrent copy. However, the memory comes from gstreamer and I cannot find a way how to create pinned memory. cudaHostRegister() is not available on TX1.
Can somone help here?

Anyway, I am actually trying to use libargus to avoid the usage of nvvidconv. Do you think this will increase the performance?

Hi
If you use CUDA you can reference to the 11_camera_object_identification MM API sample code for your case.

Thank you, I will take a look into this example.
Any suggestions how to parallelize HtoD and DtoH copies?

Hi, to get forward with this issue, I added the qdrep files from the Tegra System Profiler to this thread.
TegraSystemProfiler.zip (14.7 MB)

I’ve noticed that in your nvcamerasrc case, the 6 HtoD memcpy’s are more spread apart. I guess it is because after capturing each video frame from the MIPI camera it would take some additional CPU cycles to do ‘nvvidconv’ (this video format conversion is not needed in the videotestsrc case).

Based on this observation, my first try would be to parallelize these ‘nvvidconv’ tasks with 6 threads, and call HtoD memcpy’s only when all 6 threads have finished the image capture and conversion.

Your remap kernel takes longer in the nvcamerasrc case. As you have pointed out, that is partly due to it overlapping with HtoD transfers (overlapped GPU memory accesses). But I don’t get why the DtoH memcpy’s take so much longer in this case??

Hi jkjunk,
thank you for your answer. The nvvidconv runs in parallel in 5 gstreamer threads. Also the appsrc element has a queue which allows to buffer the frames. The queue should be always filled at that low framerate. I think the spread comes from a system bottleneck created by nvvidconv.

I am trying to avoid nvvidconv by using libargus.
I got quite far with the example that ShaneCCC has posted.

I am getting to the part where I got the NvBuffer:

IFrameConsumer *iFrameConsumer = interface_cast<IFrameConsumer>(m_consumer); 
    UniqueObj<Frame> frame(iFrameConsumer->acquireFrame());

    
    IFrame *iFrame = interface_cast<IFrame>(frame);
    
    NV::IImageNativeBuffer *iNativeBuffer =interface_cast<NV::IImageNativeBuffer>(iFrame->getImage());
    
    auto fd = iNativeBuffer->createNvBuffer(Size(1640, 1232),NvBufferColorFormat_YUV420,NvBufferLayout_Pitch);

The example stops here using v4l2_buffer which I do not need.

My question is now, how can I get a cuda pointer from the NvBuffer that I can use for my kernel?
Any Argus experts here?

Hi,

You need to register image via EGL to make it cuda-accessible.

Flow likes this:
V4L2_buffer -> EGLImageKHR -> CUDA-Array
(dmabuf_fd) (cuGraphicsEGLRegisterImage) (pDevPtr)

Please find MMAPI sample for details:
‘/home/ubuntu/tegra_multimedia_api/samples/backend/v4l2_backend_main.cpp’

Thank you Aasta,

We will try to figure this out.

So the way to go is:
OutputStream->FrameConsumer->iFrame->ImageNativeImage->NvBuffer->v4l2buffer->EGLImageKHR->CudaArray

Is this correct?
Is this the most efficient way to push an image from libargus to cuda?
Where in the pipeline will the data get copied?

I am asking because I wanted to use libargus to decrease the number of data copies.

Hi,

Thanks for your feedback.
My previous response may cause some confusion.
It doesn’t need to translate camera source into v4l2buffer to make it cuds-accessible. Key point is to register camera source with egl.

For your use-case, pipeline should be EGLStream -> CUeglFrame -> CudaArray.
For argus -> cuda, please refer to this sample for more information:
‘/home/ubuntu/tegra_multimedia_api/argus/samples/cudaHistogram/’

Hi Aasta,

thank you I already found the cudaHistogram example last week and implemented a camera grabber using libargus.
Unfortunately it does not help to fix the initial problem. A look at tegrastats shows high memory usage:
For libargus

RAM 3050/3995MB (lfb 7x4MB) cpu [43%,43%,59%,66%]@1734 EMC 23%@1600 AVP 8%@12 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734
RAM 3208/3995MB (lfb 7x4MB) cpu [80%,53%,50%,38%]@1734 EMC 32%@1600 AVP 1%@12 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734
RAM 3209/3995MB (lfb 7x4MB) cpu [83%,52%,60%,49%]@1734 EMC 36%@1600 AVP 1%@12 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734
RAM 3585/3995MB (lfb 6x4MB) cpu [85%,60%,53%,62%]@1734 EMC 42%@1600 AVP 1%@12 NVDEC 268 MSENC 268 GR3D 42%@998 EDP limit 1734
RAM 3804/3995MB (lfb 1x4MB) cpu [86%,84%,84%,84%]@1734 EMC 48%@1600 AVP 1%@12 NVDEC 268 MSENC 268 GR3D 12%@998 EDP limit 1734
RAM 3810/3995MB (lfb 1x4MB) cpu [84%,84%,83%,84%]@1734 EMC 52%@1600 AVP 1%@12 NVDEC 716 MSENC 716 GR3D 13%@998 EDP limit 1734
RAM 3812/3995MB (lfb 1x4MB) cpu [85%,85%,86%,80%]@1734 EMC 54%@1600 AVP 1%@12 NVDEC 716 MSENC 716 GR3D 57%@998 EDP limit 1734
RAM 3821/3995MB (lfb 1x4MB) cpu [87%,84%,88%,83%]@1734 EMC 56%@1600 AVP 1%@12 NVDEC 716 MSENC 716 GR3D 38%@998 EDP limit 1734
RAM 3828/3995MB (lfb 1x4MB) cpu [87%,84%,89%,85%]@1734 EMC 56%@1600 AVP 1%@12 NVDEC 716 MSENC 716 GR3D 67%@998 EDP limit 1734
RAM 3828/3995MB (lfb 1x4MB) cpu [89%,83%,90%,87%]@1734 EMC 56%@1600 AVP 1%@12 NVDEC 716 MSENC 716 GR3D 46%@998 EDP limit 1734
RAM 3825/3995MB (lfb 1x4MB) cpu [88%,87%,87%,83%]@1734 EMC 56%@1600 AVP 1%@12 NVDEC 268 MSENC 268 GR3D 5%@998 EDP limit 1734
RAM 3826/3995MB (lfb 1x4MB) cpu [87%,86%,85%,88%]@1734 EMC 56%@1600 AVP 1%@12 NVDEC 716 MSENC 716 GR3D 13%@998 EDP limit 1734
RAM 3828/3995MB (lfb 1x4MB) cpu [90%,86%,86%,89%]@1734 EMC 56%@1600 AVP 1%@12 NVDEC 268 MSENC 268 GR3D 14%@998 EDP limit 1734

For gstreamer using videotestsrc (same amount of data):

RAM 2203/3995MB (lfb 7x4MB) cpu [36%,51%,41%,43%]@1734 EMC 33%@1600 AVP 15%@12 NVDEC 716 MSENC 716 GR3D 9%@998 EDP limit 1734
RAM 2203/3995MB (lfb 7x4MB) cpu [44%,54%,42%,39%]@1734 EMC 37%@1600 AVP 22%@12 NVDEC 716 MSENC 716 GR3D 35%@998 EDP limit 1734
RAM 2203/3995MB (lfb 7x4MB) cpu [43%,39%,37%,34%]@1734 EMC 38%@1600 AVP 22%@12 NVDEC 716 MSENC 716 GR3D 22%@998 EDP limit 1734
RAM 2203/3995MB (lfb 7x4MB) cpu [46%,52%,46%,36%]@1734 EMC 38%@1600 AVP 29%@12 NVDEC 716 MSENC 716 GR3D 20%@998 EDP limit 1734
RAM 2203/3995MB (lfb 7x4MB) cpu [40%,56%,45%,39%]@1734 EMC 38%@1600 AVP 29%@12 NVDEC 716 MSENC 716 GR3D 16%@998 EDP limit 1734
RAM 2203/3995MB (lfb 7x4MB) cpu [41%,53%,36%,27%]@1734 EMC 38%@1600 AVP 15%@12 NVDEC 716 MSENC 716 GR3D 14%@998 EDP limit 1734
RAM 2203/3995MB (lfb 7x4MB) cpu [53%,50%,35%,44%]@1734 EMC 38%@1600 AVP 23%@12 NVDEC 716 MSENC 716 GR3D 6%@998 EDP limit 1734
RAM 2203/3995MB (lfb 7x4MB) cpu [48%,52%,45%,35%]@1734 EMC 38%@1600 AVP 23%@12 NVDEC 716 MSENC 716 GR3D 9%@998 EDP limit 1734
RAM 2203/3995MB (lfb 7x4MB) cpu [33%,33%,31%,47%]@1734 EMC 38%@1600 AVP 16%@12 NVDEC 716 MSENC 716 GR3D 6%@998 EDP limit 1734
RAM 2202/3995MB (lfb 6x4MB) cpu [31%,25%,35%,33%]@1734 EMC 28%@1600 AVP 30%@12 NVDEC 268 MSENC 268 GR3D 0%@998 EDP limit 1734

The RAM and EMC usage still hints that there seems to some huge problems with grabbing the cameras.
What could that be?

I got 25FPS (fixed to it) for videotestsrc and 16FPS for libargus (the same I got for nvcamerasrc with gstreamer)

Gstreamer also seems to utilze the AVP while libargus does not.

Hi crossfire
Could you try the dummy stitch process to check if the frame rate can increasing.