NVMM and Gstreamer

I have two questions regarding NVMM memory:

First of all, what is NVMM, exactly, in technical terms? What does copying to/from normal memory to NVMM memory and back involve? I am particularly interested in whether copies involve a bus or are normal memory-to-memory copies. Also, what is its relation to CUDA memory? Are they the same things? If there is a reference describing the internals of TX2 architecture, describing how different subsystems communicate, it would greatly help me in understanding performance issues.

Second, is it possible for me to write a Gstreamer element that outputs directly into NVMM memory? If yes, how so? Is there a sample code available?

Hi,
It is DMA buffers. The DMA buffers can be transferred between HW components.
We have developed tegra_multimedia_api to have NvBuffer. You can access it via APIs defined in nvbuf_utils.h
Please install the samples via Jetpack and refer to
https://developer.nvidia.com/embedded/dlc/l4t-multimedia-api-reference-28-2-ga

By DMA buffers you mean hardware memory mapped into our address space? Or normal memory, made available to hardware for DMA? Is there a bus involved?

I’ll be sure to take a look at those examples. Since you didn’t saying it explicitly, are you confirming that it is possible to write Gstreamer plugins that can work with NVMM memory and can interface with Nvidia elements like nvvidconv, etc?

None of the examples seem related to Gstreamer however. Do you know of any samples about interfacing with Gstreamer?

Another thing I forgot was CUDA. Is there a way to cheaply send data from CUDA to NVMM. My use-case involves some processing best done in CUDA. Copying data from CUDA back to normal memory is rather expensive since they are uncompressed RGB frames. Instead, I’d rather find a way to send them directly to the hardware encoder and copy data back after it’s encoded.

Hi,
In gstreamer, you can access CUDA via nvivafilter. Two posts for your reference:
https://devtalk.nvidia.com/default/topic/963123/jetson-tx1/video-mapping-on-jetson-tx1/post/4979740/#4979740
https://devtalk.nvidia.com/default/topic/978438/jetson-tx1/optimizing-access-to-image-data-acquired-with-nvcamerasrc/post/5026998/#5026998

Please also try tegra_multiemdia_api. You may refer to below sample:

tegra_multimedia_api\samples\03_video_cuda_enc

Let me explain my exact situation, so that you know why not all this has helped me yet. I have a USB 3.0 camera that outputs 4K data at around 30 fps. We have a GStreamer pipeline that is already working with another CSI-2 camera that we intend to replace with the USB camera.

Now the USB camera outputs bayer data, so we need to debayer first. nvivafilter cannot do the trick, since it does not accept bayer format. I tried de-bayering using CUDA first (with help from Nvidia NPP library). That works but not at the frame rate we need. I profiled the code and realized that copying bayer data to CUDA memory takes about 12ms per frame, which is acceptable. But then we need to copy RGB data back to memory which takes around 40ms. This obviously makes it impossible to achieve 30 fps.

So the only logical solution remaining is to get rid of some of those memcpy’s. If, we could somehow pass the CUDA buffers directly to the hardware encoder in the pipeline, this could work. Otherwise, all the copies make the whole thing impossible.

I am looking at the sample code above, though I still haven’t seen anything GStreamer related. I am also looking at the gst-omx and gst-jpeg plugins source code but haven’t been able to find anything yet.

Hi,
The gstreamer implementation may not be able to demonstrate the case. Please try tegra_multimedia_api.

You can create NvBuffer in RGBA and put de-bayered data into the buffer via CUDA, convert it to YUV420 via NvBuffer APIs, and send into NvVideoEncoder to get h264 stream.

HW engines do not support 24 byte RGB format, so you have to use 32 byte RGBA or BGRx.

You may refer to below samples and adapt to your case:

tegra_multimedia_api\samples\12_camera_v4l2_cuda

https://devtalk.nvidia.com/default/topic/1031967/jetson-tx2/tegra_multimedia_api-dq-buffer-from-encoder-output_plane-can-not-completed/post/5251268/#5251268

This sounds promising, though it inspires more questions!

  1. The addLabels function, called by HandleEGLImage, is where the real CUDA processing happens and needs to be replaced by the functionality I have in mind. Am I right?

  2. If we can indeed create NVMM buffers, isn’t there a way to pass them out from a GStreamer element? That is obviously possible, as nvidia plugins do it, but can’t we also do it? Perhaps by something like memory-mapping the NvBuffer, wrapping it in a GstBuffer and pushing it out?

Yes, the code is at tegra_multimedia_api\samples\common\algorithm\cuda

It is not supported. The solution is to send NvBuffers to NvVideoencoder and get h264 stream. The h264 stream can be wrapped in GstBuffer to send to gstreamer element.

Looks like a reasonable alternative. I’ll be sure to try it. I’m trying to test it, although I can’t run that sample program directly (no v4l2 compatible camera), so I’m trying to build a derivative program from it that I can run. The cuGraphicsEGLRegisterImage function (called by Handle_EGLImage) is returning CUDA_ERROR_INVALID_VALUE which is not even documented as a possible error. Do you have any idea what that could be about? I’ve checked the arguments in gdb and they seem to have reasonable values.

In any case, I’ll report back here if this works as I think it should.

Hi,
You may connect a USB camera to run 12_camera_v4l2_cuda first. To ensure it runs fine before adaptation.

The call flow is to create NVBuffer first , and then use fd to call NvEGLImageFromFd()

I manaaged to find a webcam and try the original sample. I believe I found the source of the error: I was using NvBufferColorFormat_XRGB32 as the color format of the output NvBuffer which for some reason doesn’t work. Changing the color format to NvBufferColorFormat_ARGB32 fixed the particular issue. This could have been better documented however, since the docs for cuGraphicsEGLRegisterImage does not even mention that such an error code is a possibility.

Anyway, I haven’t finished my tests, but I believe this can be done as you’ve described. I’m not sure which of your answers I should “accept” since they helped answer my question as a whole. The idea that I can wrap the output of the encoder in a GstBuffer and push it our of my GStreamer element was particularly helpful. Thanks for all the help.

Hi all,

I am also interested in understanding NVMM as @elektito asked at the beginning of this thread (then it derived on another subject, without really addressing this one).

I am working on capturing from CSI camera with gstreamer and retrieving the frames into an OpenCV app using “appsink”. Below is a very basic pipeline which serve as a prototype for my tests. It is inspired from info found here and there on this very forum.

gst-launch-1.0 nvcamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM), width=(int)3840, height=(int)2160, format=(string)I420, framerate=(fraction)15/1' ! nvvidconv ! 'video/x-raw, format=(string)BGRx' ! videoconvert ! ‘video/x-raw, format=(string)BGR’ ! appsink

I would like to rekindle what was asked by @elektito ie:

  • when I transfer data from NVMM to CPU memory (ie. at “nvvidconv”), does it actually perform a copy?
  • when working into/from NVMM, what bus is involved?
  • is it hardware memory or memory made available to hardware components?

My objective is to avoid as much mem copy as possible and anticipate future bus/IPS/VI access conflicts between this kind of pipeline and future downstream process such as CUDA.

Thanks in advance for any info/pointers to info you might share!

VIDIOC_S_FMT: failed: Device or resource busy /dev/video0 --stream-to=ov5693.raw
VIDIOC_REQBUFS: failed: Device or resource busy

how to solve…

Hi tejaswinig, please make a new post with more information such as your sensor type or brand id. USB camera, YUV sensor, or Bayer sensor? E-con or Lepoard?

Hi videlo,
We have several posts about OpenCV + gstreamer/tegra_multimedia_api. Please check:
https://devtalk.nvidia.com/default/topic/1024245/jetson-tx2/opencv-3-3-and-integrated-camera-problems-/post/5210735/#5210735
https://devtalk.nvidia.com/default/topic/1047563/jetson-tx2/libargus-eglstream-to-nvivafilter/post/5319890/#5319890
https://devtalk.nvidia.com/default/topic/1037863/jetson-tx2/argus-and-opencv/post/5273400/#5273400
https://devtalk.nvidia.com/default/topic/1047563/jetson-tx2/libargus-eglstream-to-nvivafilter/post/5319890/#5319890

If your video processing can be in cv::gpuMat, it can be zero memcpy.

Hi videlo,

Let me try to answer your questions with what I have found out so far.

As far as I have seen, transferring buffers from NVMM memory to normal memory always involves a copy, even when copying to CUDA memory. This I surmise from the performance of the operations I’ve observed. Also copying to and from CUDA memory again involves copying so I haven’t seen anything that looks like zero-copy between any pairings of normal memory, NVMM and CUDA memory, at least using the gstreamer API.

NVMM memory is, as answered before by DaneLLL is a set of DMA buffers. As far as I can tell, it’s just normal memory mapped to be usable by hardware encoders, decoders and converters. This should mean that the copies go through the memory bus, no extra overhead involved. Funny thing is, the same is true for CUDA memory (as the TX2 has no dedicated GPU memory) but still copying to and from CUDA memory is more costly than copying normal memory. I haven’t found out why.

All of the above is from experience, so I might be wrong on some points. Take it with a pinch of salt!

In your case, I would suggest taking a look at Tegra Multimedia API and Argus. It would allow you to map NVMM memory and directly access it. I haven’t used it to receive data from camera (my use case involved a non-NVMM-enabled camera and the hardware encoder) but as far as I have seen it is possible. Take a look at the 09_camera_jpeg_capture example in Tegra Multimedia API. It shows how to acquire frames from the camera. After that, you can map the memory with NvBufferMemMap and just use it for whatever purpose you have in mind.

Hi,

DaneLLL,
Thank you for these pointers. I will check them out. I was focusing on a GStreamer acquisition solution, that is why I did not payed too much attention to the tegra multimedia api until now.

elektito,
Thank you very much for these insights!

Is it not curious that we must perform a copy if the memory is just mapped? Should not the memory be shared and accessible by everyone ? (Or is this what you call “mapping NVMM memory and directly accessing it”? I have not checked yet.)

Typically, I would like to retrieve data from “appsink” directly into the NVMM memory. As far as I have tested, this does not work. Using a gstreamer pipeline in my case was motivated by the potential gain in codding time from not using the API. Are you telling me that the API is more flexible/able/complete that the gstreamer nvidia proprietary elements (such as nvcamerasrc, nvvidconv…) alone?

I believe the copy, in case of NVMM, is a byproduct of how nvvidconv works. It’s expected to make a copy from NVMM to normal memory. The NVMM buffers, I assume, cannot be kept indefinitely as there are a limited number of them. The fact that with CUDA memory also involves copies, I cannot explain, but that’s not what you were looking for.

About flexibility, the Multimedia APIs definitely give you more flexibility than the gstreamer APIs. They are not as convenient but you might not have a choice depending on your use case. One bonus point I received from using the MM APIs was that I got a lot more insights about how things work under the hood of the gstreamer elements I had been using.