Split video and share via with other processes

Hi

My goal is to split 3 2160p@30fps videos into 12 1080p@30fps videos and share them these with 12 other processes.

But I’m running into performance issues after the 10th video. Till then it works perfectly.

I already invested a bunch of days into this issue so here’s what I already tried:

I set the VIC to max performance Nvvideoconvert issue, nvvideoconvert in DS4 is better than Ds5? - #3 by DaneLLL

I’m splitting the video by using gstreamer

filesrc location=/media/developer/hero ! matroskademux name=demux1 demux1. ! nvv4l2decoder ! video/x-raw(memory:NVMM),format=NV12 ! queue ! tee name=t1
  t1. ! queue ! nvvidconv left=0 right=1920 top=0 bottom=1080 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink0 sync=true
  t1. ! queue ! nvvidconv left=1920 right=3840 top=0 bottom=1080 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink1 sync=true
  t1. ! queue ! nvvidconv left=0 right=1920 top=1080 bottom=2160 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink2 sync=true
  t1. ! queue ! nvvidconv left=1920 right=3840 top=1080 bottom=2160 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink3 sync=true
  t1. ! queue ! nvvidconv left=0 right=1920 top=0 bottom=1080 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink4 sync=true
  t1. ! queue ! nvvidconv left=1920 right=3840 top=0 bottom=1080 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink5 sync=true
  t1. ! queue ! nvvidconv left=0 right=1920 top=1080 bottom=2160 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink6 sync=true
  t1. ! queue ! nvvidconv left=1920 right=3840 top=1080 bottom=2160 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink7 sync=true
  t1. ! queue ! nvvidconv left=0 right=1920 top=0 bottom=1080 ! video/x-raw (memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink8 sync=true
  t1. ! queue ! nvvidconv left=1920 right=3840 top=0 bottom=1080 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink9 sync=true
  t1. ! queue ! nvvidconv left=0 right=1920 top=1080 bottom=2160 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink10 sync=true
  t1. ! queue ! nvvidconv left=1920 right=3840 top=1080 bottom=2160 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12 ! queue ! appsink name=appsink11 sync=true

the splitting works VERY well and I get above 100fps when sync is false. Awesome!

in the 12 gstreamer-appsinks (12 different threads) I use 12 different NVBufferSession to copy the buffers using NvBufferTransformEx.
The Buffer I’m copying to is set up the same width/height using NV12 and BlockLinear. The Buffer I’m copying to is getting reused.
This Copy operation seems quite expensive to me, because the framerate is now less than 75. Any idea on how to improve this rate?
What’s weird is, that when I transfer that filedescriptor to another process and copy it there using NvBufferTransformEx I get less than 70 fps. Although I use the exact same function.

I lose even more time when each child process is then encoding the stream:

appsrc name=mysource block=true
  ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12,framerate=30/1
  ! queue
  ! nvv4l2h264enc maxperf-enable=1
  ! queue
  ! h264parse
  ! matroskamux
  ! filesink location=out sync=false

Because when I now push these resulting file descriptor to another gstreamer pipeline that encodes it, we have around 25fps and a very unreliable stream.

When I don’t do any NvBufferTransformEx I’m able to split to 12 1080p and encode to h264 at 90fps.

Things work best when I remove all NvBufferTransformEx and do all work in one process.
Things get worse when I add “appsink → NvBufferTransformEx → appsrc”. I can do 30fps, but it’s on the edge.
As soon as inter-process sharing comes into play the framerate drops even further to ~25fps.

So far all my research pointed to the NvBufferTransformEx. It’s “slow” copying buffers. and even slower between two processes, although each copy happens in it’s own thread, own NvBufferSession and buffers/sessions/filedescriptors are getting re-used. And the Buffer are getting set up with NV12 and BlockLayout.

Any other tips to improve the speed of NvBufferTransformEx? despite highest frequency?

Is there another way to share a video buffer between processes?

This thread is a follow-up from Split video into 4 smaller ones - #7 by qwertzui11

Any help would be greatly appreciated,
Cheers,
Markus

Not sure for your case, but you may adjust crop parameters so that pixel-aspect-ratio keeps 1/1:

t1. ! queue ! nvvidconv left=0 right=1919 top=0 bottom=1079 ! video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12,pixel-aspect-ratio=1/1 ! ...
2 Likes

Hi. thx for your great ideas!

I tried both

  • I reduced right and bottom from 1920 to 1919.
  • I set pixel-aspect-ratio=1/1

Sadly with no impact on the performance.

However I got the feeling that NvBufferTransformEx does some rather expensive operation.

Here’s my code on the transform (copy)

      NvBufferTransformParams transform_params{};
      memset(&transform_params, 0, sizeof(NvBufferTransformParams));
      if (session == nullptr) {
        session = NvBufferSessionCreate();
      }
      transform_params.session = session;
      transform_params.src_rect = {0, 0, source_params.params.width[0],
                                   source_params.params.height[0]};
      transform_params.dst_rect = {0, 0, target_params.first.width[0],
                                   target_params.first.height[0]};
      assert(transform_params.src_rect.top == 0);
      assert(transform_params.src_rect.left == 0);
      assert(transform_params.src_rect.width == 1920);
      assert(transform_params.src_rect.height == 1080);
      assert(transform_params.dst_rect.top == 0);
      assert(transform_params.dst_rect.left == 0);
      assert(transform_params.dst_rect.width == 1920);
      assert(transform_params.dst_rect.height == 1080);
      assert(source_params.params_ex.params.pixel_format ==
             target_params.second.params.pixel_format);

and here’s the code on how I create the target buffer

        NvBufferCreateParams create_params{};
        create_params.width = source_params.params.width[0];
        create_params.height = source_params.params.height[0];
        create_params.colorFormat = source_params.params.pixel_format;
        create_params.layout = NvBufferLayout_BlockLinear;
        assert(create_params.width == 1920);
        assert(create_params.height == 1080);
        assert(create_params.colorFormat == NvBufferColorFormat_NV12);
        assert(create_params.layout == NvBufferLayout_BlockLinear);

Do you @Honey_Patouceul , or anyone else, see any mistakes, that could lead to an actual transform, instead of a ?cheap? copy?

Cheers,
Markus

Hi,
We don’t verify NvBufferTransformEx() in 12 processes and it may not be able to achieve target frame rate in the use-case. One possible solution for this use-case is to have single process and set up 12 UDP streaming. So that you can receive individual UDP stream for each process.

1 Like

Thx for the answer!

This would mean, that I’d have to access the raw video memory and forward the raw memory to the sub-processes using UDP?
So work on the CPU and NOT with DMA-Memory, correct?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.