I’m trying to do real-time video processing on the Jetson TX1. I’ve got a Magewell ProCapture HDMI (PCIe capture card) connected to the PCIe slot on the Jetson, feeding uncompressed 1920x1080 4:4:4 RGB frames @ approx. 60fps. The card is claimed to be SGDMA-capable and Magewell’s SDK supplies a function which supposedly transfers frames to physical addresses (MWCaptureFrameToPhysicalAddress). I’ve tried/profiled the following methods of transferring the frames to device memory (for processing the data in CUDA kernels):
a) Set cudaDeviceMapHost flag. Malloc mapped memory on the host side ( cudaHostAlloc(… , cudaHostAllocMapped)). Get device pointer (cudaHostGetDevicePointer()). Use the Magewell API function (MWCaptureVideoFrameToVirtualAddressEx) to transfer the frame to this host memory location. So this is zero-copy (if I’m not wrong)
b) Use Magewell API function but without zero copy this time (malloc on host side, cudaMalloc on device side, use cudaMemcpy to transfer)
For c,d → the Magewell device shows up as a video input on V4L2 (/dev/videoX)
c) Malloc mapped memory on the host side (like (a)). Use the OpenCV VideoReader to read frames via the V4L2 interface into the mapped memory slot. So this is zero-copy (if I’m not wrong).
d) Malloc memory on the host side. Use the OpenCV VideoReader to read frames via the V4L2 interface into this host memory and then do cudaMemcpy to the cudaMalloc’ed device memory.
My question is: These are all methods that first write to host side memory and then either I transfer them to the device memory (via cudaMemcpy) or CUDA handles it when it’s zero-copy (I guess?). Is there a way to directly write these frames into device memory, bypassing the host side? I know this would be possible if I was using some GPUDirect-capable GPU but is there a similar option on the Jetson TX1 which would be faster than the above-mentioned methods (a-d)?
By the way the application does the following:
Take a 1920x1080 RGB frame (so approx. 6MB) into device memory (using one of the methods above)
I guess you’re referring to a snippet in a source under “/home/ubuntu/tegra_multimedia_api/samples”, is that correct? Could you point to a specific example under tegra_multimedia_api?
I’ve bumped into the following and think I have a grasp on the issue:
In Allen_Z’s post [1] he suggests the following:
AastaLLL confirms this is a good method:
Then dumbgeorge says he couldn’t decode the above conversation in his post [2] and WayneWWW says he needs to look at the “mmapi backend sample” and use the mapEGLImage2Float to map this into CUDA-accessible memory:
So from what I can understand from these, the procedure should be as follows:
During initialization:
Call “int NvBufferCreate (int *dmabuf_fd, int width, int height, NvBufferLayout layout, NvBufferColorFormat colorFormat);” to get a pointer (dmabuf_fd) to a memory buffer (physical memory location to which I can write from the third party PCIe capture card?).
Register the image to an “EGL display” that’s loaded onto this buffer using “NvEGLImageFromFd(EGLDisplay display, int *dmabuf_fd)” (I’m assuming there was a typo in that post when WayneWWW wrote “int dmabuf_fd” instead of “int *dmabuf_fd”)
Use mapEGLImage2Float(…) to register this image to a CUDA kernel accessible memory location (which, I assume, is this “void* cuda_buf”, which would be a device pointer?)
During runtime:
Use the third party PCIe capture card API to write into this physical memory location pointed to by “int *dmabuf_fd”
Use the device pointer specified by “void* cuda_buf” above to process the input image.
AastaLLL could you confirm the above procedure or correct me if it’s not OK?
I’m trying to initialize an NvBuffer using NvBufferCreate from the “nvbuf_utils” library as you’ve confirmed in step 1)
#define Mx 1024
#define My 1024
byte *data_mem;
int dmabuf_fd1 = 0;
int ret;
// int NvBufferCreate (int *dmabuf_fd, int width, int height, NvBufferLayout layout, NvBufferColorFormat colorFormat);
ret = NvBufferCreate(&dmabuf_fd1, (int) My, (int) Mx, NvBufferLayout_BlockLinear, NvBufferColorFormat_XRGB32);
EGLDisplay egl_display;
// Get defalut EGL display
egl_display = eglGetDisplay(EGL_DEFAULT_DISPLAY);
if (egl_display == EGL_NO_DISPLAY)
{
std::cout << "Error while get EGL display connection" << std::endl;
}
// Init EGL display connection
if (!eglInitialize(egl_display, NULL, NULL))
{
std::cout << "Error while initialize EGL display connection" << std::endl;
}
EGLImageKHR egl_image = NULL;
egl_image = NvEGLImageFromFd(egl_display, dmabuf_fd1);
if(egl_image == NULL)
{
std::cout << "NvEGLImageFromFd failed" << std::endl;
}
void *cuda_buf = ptr_d; // ptr_d is a device pointer I properly cudaMalloc beforehand
// map eglimage into GPU address
mapEGLImage2Float(&egl_image, Mx, My, (byte *)cuda_buf);
I need the start memory address of this initialized buffer in order to copy the input frame to this memory location using the PCIe capture card API. How can I get this memory address? (I guess this dmabuf_fd is just a file descriptor which somehow signifies the buffer, not the memory address like I stated in my previous post. Right?)
Also, it seems like the buffer isn’t getting created correctly because I’m getting the following output:
NvEGLImageFromFd: Failed to create EGLImage from dma-buf fd (1828717745)
NvEGLImageFromFd failed
cuGraphicsEGLRegisterImage failed: 999, cuda process stop
In your last post did you mean to say a) or b) (see both below)? Or something else that I didn’t catch? Could you please give more details?
If a) is true, does this mean there is no way to get frames coming from a PCIe capture card directly (not from host memory with cudaMemcpy or with zero copy) into CUDA kernel-accesible memory on a Jetson TX1?
If b) is true, where can I get this driver?
Assuming I’ve somehow (using your answer for question 1) made my inupt a CSI type input → I understand I need to use the createNvBuffer() from Argus library, which gives me a file descriptor (FD). Then I’ll pass this FD to NvEGLImageFromFd() and the EGL image from there to mapEGLImage2Float() to set up the buffer. I need the memory address of this buffer to copy data in there. How do I get the memory address of the buffer to which I need to feed the data? Is there a function in Argus that automatically does this copy by looking at the FD (do I actually not need this memory address)?
a) I need to get this TC358840, look at the post you linked, use their drivers etc. to set this chip up. Then use the createNvBuffer() function in the Argus library to get the buffer going using the CSI input, then NvEGLImageFromFd, then mapEGLImage2Float() → I have the a buffer linked to CUDA-accessible memory.
b) I need to get a software driver from somewhere to trick the Jetson into routing frames coming into my PCIe input to a CSI buffer, and then use the createNvBuffer() function in the Argus library to get the buffer going, then NvEGLImageFromFd, then mapEGLImage2Float() → I have a buffer linked to CUDA-accessible memory.
Thanks for taking the time to discuss and for your recommendation but I really have a hard time bringing the pieces together from your answers. Can you please answer the numbered questions in my previous post?
About using a PCIe → V4L2 approach:
The vendor does have a driver for this, the device shows up as /dev/video1 and I can get frames using for example the OpenCV VideoReader but the latency is huge and I still have to use cudaMemcpy to get them into device memory for processing by my CUDA kernels since the copy via V4L2 is done into host memory so this is not an answer to my question.
I would really appreciate it if you could answer the numbered questions in my previous post.
You don’t need to get a TC358840. I post the topic just because you are facing the similar issue.
So the procedure should be:
Enable the PCIe → V4l2:
It’s good to know you have the driver already.
Read camera frame with MMAPI. Once you have configured your camera to the v4l2 interface, you can open it as general a USB-camera.
Check sample 12_camera_v4l2_cuda for details.
By the way, OpenCV uses CPU-based FFmpeg to decode camera frame. And it is slow.
Thanks.
Thanks for the reply. I understand that NvBuffer is an open source class but your response doesn’t answer my question.
I want to access the methods and fields of an NvBuffer like any other object.
But nvbuf_utils doesn’t give me an NvBuffer object, or even a pointer to an NvBuffer object. Instead, it gives me a file descriptor.
For example, I’m searching for the missing piece in the following pseudocode:
int file_descriptor_of_the_nvbuffer_instance;
NvBufferCreate(&file_descriptor_of_the_nvbuffer_instance, .......)
// now I have a file descriptor for the NvBuffer. But this isn't useful to me because I can't access methods and fields with a file descriptor
NvBuffer* pointer_to_the_nvbuffer_instance
// MISSING PIECE, WANT TO GET THE POINTER TO THE NVBUFFER REPRESENTED BY THE FILE DESCRIPTOR
// now I can access the methods and fields of the NvBuffer
// for example:
std::cout << pointer_to_the_nvbuffer_instance->planes[0].fmt.width << std::endl;
// or another example
pointer_to_the_nvbuffer_instance->planes[0].data = some_other_pointer
Thanks! That’s the first thing I tried, and that is what I thought I was looking for, but it doesn’t seem to work.
For example, see the following example:
int fd;
NvBufferCreate(&fd, 1920, 1080, NvBufferLayout_Pitch, NvBufferColorFormat_UYVY);
NvBufferParams params;
NvBufferGetParams(fd, ¶ms);
// parameters are set properly here, and I can access the nv_buffer field of the params struct
NvBuffer* nvbuf_ptr = (NvBuffer*) params.nv_buffer;
std::cout << nvbuf_ptr->planes[0].fmt.width << std::endl;
// compiles and runs without error, but outputs garbage data
Hi every
Currently, We want to develop a board with xilinx FPGA and TX2,and tx2 capture video from FPGA by PCIE GEN2X4.
We have a linux driver is work fine in intel x86 CPU but do not test in TX2.We alse worry about memory copy could take more time in tx2 platform.Our driver use “dma_alloc_coherent” linux API with 4MB size,and FPGA have a DMA transfer video data to TX2 DRAM. Could we create NVBUF(V4L2_MEMORY_DMABUF) and get it address point and give it to fpga DMA write directly? OR do you have any way to reduce memory copy times?