NVMM memory in custom GStreamer plugin

Hi,

We are writing our custom GStreamer plugin and we want to reuse NVIDIA NVMM memory “video/x-raw(memory:NVMM)” to avoid copying frame buffers.

The plugin we wrote cannot link with “nvvidconv” NVIDIA Gstreamer plugin, and shows the error:
Error: unable to link nvvconv0 with aggregatedoverlay0

although capabilities of our plugin are fine:

SINK template: ‘sink_%u’
Availability: On request
Has request_new_pad() function: gst_videoaggregator_request_new_pad
Capabilities:
video/x-raw(memory:NVMM)
format: { BGRx }
width: [ 1, 2147483647 ]
height: [ 1, 2147483647 ]
framerate: [ 0/1, 2147483647/1 ]

Do you have code sample or advice how to link modules with NVMM memory?
If NVMM is closed format used in NVIDIA only modules, does other way exists to access nvvidconv results without making a memory copy operation?

Please check nvivafilter in gstreamer user guide:

Still doesn’t help a lot… Especially when we have GStreamer plugin with two inputs and one output.

Looks same as https://devtalk.nvidia.com/default/topic/1019224/jetson-tx1/how-to-pass-data-between-gstreamer-and-cuda-without-memory-copying-/post/5192702/#5192702

Please try tegra_multimedia_api

I will try to explain more detailed.

NVIDIA has a lot of good Gstreamer plugins, optimized for GPU.
We also have own Gstreamer plugins for image stabilization, sensor fusion (fusion of daylight camera, and night vision camera), neural network based object detection etc. All our modules are also optimized for GPU, and they work amazingly fast.

The problem - how to use them together, without copying the memory. Because each time we have to make memory copies, and it ruins all advantages of NVIDIA Gstreamer plugins. At the beginning of NVIDIA plugins CPU memory is copied to CUDA memory, processed and then copied back to CPU. After that we are making the same thing: we are taking memory from NVIDIA Gstreamer plugin (CPU memory) and allocating Cuda Managed memory and making a copy.

These questions still remain relevant:

  1. Do you have code sample or advice how to link modules with NVMM memory?

  2. If NVMM is closed format used in NVIDIA only modules, does other way exists to access nvvidconv (or results from other NVIDIA Gstreamer plugins) without making a memory copy operation? How we can access result in Managed or Cuda memory?

Hi,
Can you share data flow of your case?

We have nvivafilter which is for one input and one output. Here is a usecase
https://devtalk.nvidia.com/default/topic/978438/jetson-tx1/optimizing-access-to-image-data-acquired-with-nvcamerasrc/post/5026998/#5026998

But you have mentioned two inputs and one output. It looks like you will perform camera frame stitching via CUDA? In this case, you have to leverage Argus, CUDA, and gstreamer.
We have a sample about Argus + gstreamer: tegra_multimedia_api/argus/samples/gstVideoEncode

hi Dane,

I’m having a very similar use case with the original poster.

I’m trying to write a custom gstreamer element that uses cuda to stitch multiple frames. The pipeline is like this:

v4l2src-
v4l2src->my_element->nvenc->…
v4l2src-/

Obviously, the optimal way is to read the captured frame from cuda kernel directly, avoiding all the memcpy H2D or D2H. Would you share your thoughts about how to achieve this? Thank you!

Hi xliu, we suggest use MM APIs and refer to
/tegra_multimedia_api/samples/12_camera_v4l2_cuda

hi Dane,

Reading this sample code, I found HandleEGLImage being called in conv_capture_dqbuf_thread_callback. This seems to be where the cuda kernel touches the frame data directly from capture.

But because of the stitcher algo’s N-in-1-out nature, I cannot put it inside this callback. I need to pass the pointer out to a single thread that has access to all the input buffers. Do you see any problem to use frame pointer from outside the converters’ dequeue callback thread?

Moreover, running this sample on TX2, I did a nvprof:

==8861== NVPROF is profiling process 8861, command: ./camera_v4l2_cuda -d /dev/video1 -s 1920x1080 -f YUV420 -c
==8861== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
^CQuit due to exit command from user!
Quit due to exit command from user!
----------- Element = renderer0 -----------
Total Profiling time = 15.3871
Average FPS = 30.0252
Total units processed = 463
Num. of late units = 63
-------------------------------------
App run was successful
==8861== Profiling application: ./camera_v4l2_cuda -d /dev/video1 -s 1920x1080 -f YUV420 -c
==8861== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  4.7186ms       468  10.082us  3.4570us  11.718us  addLabelsKernel(int*, int)

==8861== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 34.98%  1.12203s       468  2.3975ms  347.79us  49.046ms  cuGraphicsUnregisterResource
 27.57%  884.09ms       468  1.8891ms  275.43us  34.506ms  cudaLaunch
 15.19%  487.09ms       468  1.0408ms  93.390us  22.706ms  cuGraphicsEGLRegisterImage
 14.83%  475.77ms       936  508.30us  24.588us  16.841ms  cuCtxSynchronize
  7.10%  227.75ms       468  486.64us  3.6500us  198.53ms  cudaFree
  0.12%  3.7150ms       468  7.9380us  1.3130us  291.31us  cudaConfigureCall
  0.11%  3.6158ms       936  3.8620us     640ns  131.27us  cudaSetupArgument
  0.10%  3.0881ms       468  6.5980us  1.2810us  224.62us  cuEGLStreamProducerPresentDevicePtr
  0.00%  92.204us        91  1.0130us     384ns  28.974us  cuDeviceGetAttribute
  0.00%  5.3470us         3  1.7820us     672ns  2.4020us  cuDeviceGetCount
  0.00%  4.9950us         1  4.9950us  4.9950us  4.9950us  cuDeviceTotalMem
  0.00%  3.1700us         3  1.0560us     608ns  1.8570us  cuDeviceGet
  0.00%  1.9210us         1  1.9210us  1.9210us  1.9210us  cuDeviceGetName

the cuGraphicsUnregisterResource and cuGraphicsEGLRegisterImage still uses a considerable execution time. Is this really zero-copy, or it’s still doing memcpy under the hood?

Hi xliu,

It should be no problem, you can refer to conv_capture_dqbuf_thread_callback() in
/tegra_multimedia_api/samplesbackend/v4l2_backend_main.cpp

It puts frames into a queue for rendering.

We have synchronization mechanism. It performs better than pure memcpy.

Hi xliu and mi_pixevia,

You may found interesting the following information about the GstCUDA framework, I think that is exactly what you are looking for.

GstCUDA is a RidgeRun developed GStreamer plug-in enabling easy CUDA algorithm integration into GStreamer pipelines. GstCUDA offers a framework that allows users to develop custom GStreamer elements that execute any CUDA algorithm. The GstCUDA framework is a series of base classes abstracting the complexity of both CUDA and GStreamer. With GstCUDA, developers avoid writing elements from scratch, allowing the developer to focus on the algorithm logic, thus accelerating time to market.

GstCUDA offers a GStreamer plugin that contains a set of elements, that are ideal for GStreamer/CUDA quick prototyping. Those elements consist in a set of filters with different input/output pads combinations, that are run-time loadable with an external custom CUDA library that contains the algorithm to be executed on the GPU on each video frame that passes through the pipeline. GstCUDA plugin allows users to develop their own CUDA processing library, pass the library into the GstCUDA filter element that best adapts to the algorithm requirements, executes the library on the GPU, passing upstream frames from the GStreamer pipeline to the GPU and passing the modified frames downstream to the next element in the GStreamer pipeline. Those elements were created with the CUDA algorithm developer in mind - supporting quick prototyping and abstracting all GStreamer concepts. The elements are fully adaptable to different project needs, making GstCUDA a powerful tool that is essential for CUDA/GStreamer project development.

One remarkable feature of GstCUDA is that it provides a zero memory copy interface between CUDA and GStreamer on Jetson TX1/TX2 platforms. This enables heavy algorithms and large amounts of data (up to 2x 4K 60fps streams) to be processed on CUDA without the performance caused by copies or memory conversions. GstCUDA provides the necessary APIs to directly handle NVMM buffers to achieve the best possible performance on Jetson TX1/TX2 platforms. It provides a series of base classes and utilities that abstract the complexity of handle memory interface between GStreamer and CUDA, so the developer can focus on what actually gives value to the end product. GstCuda ensures an optimal performance for GStreamer/CUDA applications on Jetson platforms.

You can find detailed information about GstCUDA on the following link:
http://developer.ridgerun.com/wiki/index.php?title=GstCUDA

I hope this information can be useful to you.

Best regards,
-Daniel