I’m trying to do real-time video processing on the Jetson TX1. I’ve got a Magewell ProCapture HDMI (PCIe capture card) connected to the PCIe slot on the Jetson, feeding uncompressed 1920x1080 4:4:4 RGB frames @ approx. 60fps. The card is claimed to be SGDMA-capable and Magewell’s SDK supplies a function which supposedly transfers frames to physical addresses (MWCaptureFrameToPhysicalAddress). I’ve tried/profiled the following methods of transferring the frames to device memory (for processing the data in CUDA kernels):
a) Set cudaDeviceMapHost flag. Malloc mapped memory on the host side ( cudaHostAlloc(… , cudaHostAllocMapped)). Get device pointer (cudaHostGetDevicePointer()). Use the Magewell API function (MWCaptureVideoFrameToVirtualAddressEx) to transfer the frame to this host memory location. So this is zero-copy (if I’m not wrong)
b) Use Magewell API function but without zero copy this time (malloc on host side, cudaMalloc on device side, use cudaMemcpy to transfer)
For c,d -> the Magewell device shows up as a video input on V4L2 (/dev/videoX)
c) Malloc mapped memory on the host side (like (a)). Use the OpenCV VideoReader to read frames via the V4L2 interface into the mapped memory slot. So this is zero-copy (if I’m not wrong).
d) Malloc memory on the host side. Use the OpenCV VideoReader to read frames via the V4L2 interface into this host memory and then do cudaMemcpy to the cudaMalloc’ed device memory.
My question is: These are all methods that first write to host side memory and then either I transfer them to the device memory (via cudaMemcpy) or CUDA handles it when it’s zero-copy (I guess?). Is there a way to directly write these frames into device memory, bypassing the host side? I know this would be possible if I was using some GPUDirect-capable GPU but is there a similar option on the Jetson TX1 which would be faster than the above-mentioned methods (a-d)?
By the way the application does the following:
- Take a 1920x1080 RGB frame (so approx. 6MB) into device memory (using one of the methods above)
- Take the FFT of this frame (C2C cuFFT)
- Element-wise multiplication with a complex number
- Take the IFFT of the multiplication (C2C cuFFT)
- Display the new frame on screen.
Thanks in advance for help,