NvVideoEncoder: maximum number of streams and VIC compositions

biglofty · September 18, 2020, 10:41am

Hi,

I am using AGX Xavier 32G. I see in the datasheet the Maximum number of streams of Encoder is 1080p30(32).
So I ran 32 streams of 1080p30 of Encoder, it works pretty good.
But when I simutaniously ran 30 streams of 1080p30 NvBufferComposite using VIC, the Encoder throughput is largely affected. The compostion is in different process.

So does the VIC composition really affects the Encoder throughput? If yes, why?

In every stream of the 32 streams Encoder, there is a Decoder before it, the type of Decoder capture plane and the type of Encoder output plane are both V4L2_MEMORY_DMABUF. And after the Decoder, there is a NvBufferTransform for transformation from BlockLinear to PitchLinear.

Thanks.

DaneLLL · September 21, 2020, 3:02am

Hi,
We enable DFS of VIC engine on r32.4.3, This may be the reason of performance drop. Please execute the steps and try again:

biglofty · September 21, 2020, 6:13am

Thanks DaneLLL,

I have already set the VIC to work on the max frequencies, using the way you provides in the thread.

I have also used NvBufferSessionCreate in the NvBufferComposite streams.

And all the encoders are set setHWPresetType(V4L2_ENC_HW_PRESET_ULTRAFAST)

But the results are the same.

DaneLLL · September 21, 2020, 7:38am

Hi,
The tables in Xavier module data sheet is decoding only and encoding only. For decoding + encoding, it may not achieve 32 instances. So you have tried decoding + encoding without VIC and it can run up to 32 instances?

biglofty · September 21, 2020, 7:53am

Hi DaneLLL,

I have tried decoding + encoding without NvBufferComposite, decoding + NvBufferComposite without encoding, the resutls are better.

And I also tried running 32 encoding in one process, and 30 streams of NvBufferComposite in another process, the performance droped.

DaneLLL · September 22, 2020, 1:44am

Hi,
We don’t have existing sample to check this and need your help to share information. So you can run 32 instances in decoding + encoding, decoding + NvBufferComposite, cannot achieve 32 instances in decoding + NvBufferComposite + encoding?

Generally NvBufferComposite() is called to composite frames from each source into one video frame. Is this your usecase? Or you use it in different way?

biglofty · September 22, 2020, 2:42am

Hi DaneLLL,

Yes, your understanding is correct.

Our complete use case is that in every instance, we want to run decoding + resampling/convert + composite + (some other image process) + encoding. We want to run as many instances as possible. We expect every instance has a stable 30 fps. The NvBufferComposite() is used to composite different frames to one frame.

When I tried to run all the components together, I didn’t get a good stable result compared to the datasheet. So I tried to take out one or some of the components to check which one affects most, that leads to my question.

I even tried runing 32 instances of only encoding in one process (every instance considered as a separate thread), and 30 instances of only NvBufferComposite() in another process simutanously. The performance droped.

So I wonder if there might be some inner resource/buffer/dmabuf management shared by both of them? This is just my guess.

Any advice to improve the performance is highly appreciated. Thanks a lot.

DaneLLL · September 22, 2020, 6:28am

Hi,
Please execute sudo nvpmodel -m 0 and sudo jetson_clocks for a try. It is MAXN mode listed in development guide.

And there might be improvement if you have all buffers in block linear format. If you send pitch linear buffers to encoder, please try the case of sending block linear buffers.

biglofty · September 22, 2020, 1:23pm

Hi DaneLLL,

I have already set MAXN mode.

I tried sending block linear to encoder. The result were the same. I wonder if there is something conflicts between encoder and NvBufferComposite. Or there is somewhere I use it wrong.

DaneLLL · September 23, 2020, 6:30am

Hi,
So the usecase may hit system limitation. Please execute sudo tegrastats to get system loading. If there is still room on GPU, you can try to implement downscale/resampling/compositing functions through CUDA, so that the loading can be shifted from VIC to GPU. This might bring some improvement.

biglofty · September 23, 2020, 9:12am

Hi DaneLLL,

Thanks for the advice. We are actually evaluating possible ways to make the most stable instances. Considering this problem, I simply want to know the reason for better using encoder and VIC composite.