NvBuffer sharing between processes without copying buffer

I’ve been testing the nvbuffer sharing code from this topic How to share the buffer in process context?, and found out that the nvbuffer copying using NvBufferTransformEx is quite time-consuimg, which might need about 3ms per 3840x2160 yuv image. In my case, I have mutiple cameras and multiple consumers and NvBufferTransformEx seems not to be so efficient in multi-thread case either, I’m wondering if the nvbuffer copying on the consumer side can be avoided. Thanks!

Hi,
NvBufferTransformEx() is run on hardware converter and it shall have much better throughput than copying through CPU. Please enable the engine at maximum clock and check if there is improvement:
Nvvideoconvert issue, nvvideoconvert in DS4 is better than Ds5? - #3 by DaneLLL

I’ve set the frequencey to maximum which is “729600000”, but it still costs around 3ms per image. For two consumers it will then become more than 6ms on average. It seems like NvBufferTransformEx() can only be done serially.

Hi,
If you call NvBufferTransformEx() parallelly, please create NvBufferSession for each operation. Please refer to
Question on NvBufferSession - #3 by DaneLLL

I created session as following and call transform_nvbuf parallelly using std::thread, but found no performance change

static int transform_nvbuf(int src_fd, NvbufParam *src_buf_par, int dst_fd) {
   NvBufferParams src_params = src_buf_par->params;
   NvBufferParamsEx src_paramsEx = src_buf_par->paramsEx;
   NvBufferParams dst_params;
   NvBufferParamsEx dst_paramsEx;
   NvBufferGetParams(dst_fd, &dst_params);
   NvBufferGetParamsEx(dst_fd, &dst_paramsEx);
+  NvBufferSession session = NvBufferSessionCreate();
+  if (session == NULL) {
+    printf("Session create failed\n");
+    return -1;
+  }
 
   NvBufferTransformParams trans_params;
   memset(&trans_params, 0, sizeof(trans_params));
+  trans_params.session = session;
   trans_params.src_rect = {0, 0, src_params.width[0], src_params.height[0]};
   trans_params.dst_rect = {0, 0, dst_params.width[0], dst_params.height[0]};
 
   int rc = NvBufferTransformEx(src_fd, &src_paramsEx, dst_fd, &dst_paramsEx,
                                &trans_params);
+  NvBufferSessionDestroy(session);
   if (rc != 0) {
     printf("NvBufferImportFd failed!\n");
     return -1;
   }
   return 0;
 }

Hi,
The code creates/destroys NvBufferSession continually. Please create it once and re-use it. If there are multiple threads calling the function, please create individual NvBufferSession in each thread.

When reusing one session for a single thread, I saw the performance got about 15% better, is this expected? On the other hand, when running two processes (two consumers), the hardware converter seems still do copying one after one and the consumed time just becomes double as one process.

Hi,
This looks to be optimal for running this use-case on TX2. Since the buffers are passed to another process through hardware converter, it should have better performance when comparing to using CPU.

Actually my question is why do we need to copy buffer at all, why can we not use the received fd on consumer side directly? This buffer copying seems meaningless for me since producer already owns one same buffer. Or is there a way/an API that I can use the fd shared from producer directly, such as VPI or other methods?

Hi,
This is constraint in current implementation. Since the resolution is 4K, it is possible to hit bottleneck if there are multiple sources.

Hi,
Have checked with our teams and there is a new function in Jetpack 4.6.3(r32.7.3) called NvBufferImportFd(). If you can upgrade to the release, please check the definition in

/usr/src/jetson_multimedia_api/include/nvbuf_utils.h

And give it a try.

I’m using the following version, which should be newer than Jetpack 4.6.3(r32.7.3), but didn’t find this API NvBufferImportFd(). Was it removed in the later release?

NVIDIA Jetson AGX Orin
L4T 35.1.0 [ JetPack 5.0.2 ]
Ubuntu 20.04.4 LTS
Kernel Version: 5.10.104-tegra
CUDA 11.4.239
CUDA Architecture: NONE
OpenCV version: 4.5.4
OpenCV Cuda: NO
CUDNN: 8.4.1.50
TensorRT: 8.4.1.5
Vision Works: NOT_INSTALLED
VPI: 2.1.6
Vulcan: 1.3.203

Hi,
The topic is in category of Jetson TX2 so we thought you use TX2/Jetpack 4. This use-case is not supported yet on Orin/Jetpack 5. Since we have deprecated NvBuffer APIs on Jetpack 5, and replaced it with NvBufSurface APIs. The function of sharing NvBufSurface between processes is not supported on Jetpack 5.0.2. It is under development and planned to be enabled in future releases.

Seems like you can run NvBuffer APIs on Jetpack 5. However, this is not tested and recommended. It may not work properly in certain cases.

Ok, thanks a lot, hope to see this feature in the future release soon :)
I will close this topic for now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.