Argus cpu usage differences between JetPack 4.6.0 and 5.1.2

Hi,

We are in the process of migrating from JetPack 4.6.0 to 5.1.2, and have noticed significantly higher cpu usage by argus on 5.1.2. I was curious if this was expected? Our setup has 8 cameras, and we are using max performance settings.

On 4.6.0 streaming all 8 cameras at 60fps has ~180% cpu usage, and adding 2 copyToNvBuffer calls for each frame increases the cpu usage to ~240%.

On 5.1.2 streaming all 8 cameras at 48fps has ~270% cpu usage, and adding 2 copyToNvBuffer calls for each frame increases the cpu usage to ~420% (and only maintains ~30fps).

1.) Is it expected that just streaming the cameras (calling acquireFrame and no other processing) would use almost 2x more cpu on 5.1.2?

2.) Is there a reason copyToNvBuffer is so much slower on 5.1.2? If I switch to using IBufferOutputStream->acquireBuffer + NvBufSurfTransform instead of IFrameConsumer->acquireFrame + copyToNvBuffer the cpu usage drops from 420% to 340% and the fps goes back up to 48 from 30. I was under the impression that copyToNvBuffer was using NvBufSurfTransform under the hood, but it must be doing something else. Should copyToNvBuffer be avoided on 5.1.2? Or is there a way to make its performance comparable to 4.6.0?

Hi,
It is possible Jetpack 5 has higher CPU usage since the kernel version is different. If you have concern to CPU usage, please consider stay on Jetpack 4 release.

Hi DaneLLL,

Can you give anymore detail on the copyToNvBuffer function and why it’s so much slower than using NvBufSurfTransform? What is it doing exactly?

Hi,
Yes, the implement is identical. Do you create multiple NvBufSurface by calling createNvBuffer()? If there’s is single NvBufSurface it may be CPU usage in polling the single NvBufSurface. If there are multiple NvBufSurface it can run as ring buffers.

Each copy is calling createNvBuffer on the first frame. On subsequent frames they call copyToNvBuffer using the fd that was returned by createNvBuffer. With 8 cameras and 2 copies per camera there are 16 calls to createNvBuffer.

I did some profiling with perf comparing IFrameConsumer + copyToNvBuffer versus IBufferOutputStream + NvBufSurfTransform . The copyToNvBuffer case shows several memory related functions taking significant time that do not show up in the NvBufSurfTransform perf result:

  • nvmap_ioctl_alloc
  • nvmap_ioctl_create
  • nvmap_ioctl_getfd
  • __mmap
  • __munmap
  • __memcpy_generic
  • __GI___memset_generic
  • dma_buf_release

When using copyToNvBuffer the nvmap-bz process also shows high cpu usage which doesn’t occur when using NvBufSurfTransform. copyToNvBuffer seems to be doing a lot of extra processing for some reason.

Also interesting: in both cases __getpid shows significant cpu usage.

Hi DaneLLL,

Could you verify that you see the same increase in cpu usage when using IFrameConsumer + copyToNvBuffer versus IBufferOutputStream + NvBufSurfTransform in JP5 . If the usage truly is that much higher we’ll convert our code over, but we’re using some custom argus library builds from D3 that have fixes+patches applied and I want to make sure that isn’t a contributing factor here.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,
We will set up developer kit and try to replicate the issue and check. Could you share a test sample which shows the issue?