Is there a way to let both the device (GPU) and host (CPUs: 4 ARM + 2 Denver) see the same region of...

Hi, all

 i am conducting algorithm research on our customized TX2 platform, the topology of which is as follows:
   Video-Graber <==Camera-Link-bus==> FPGA <====PCIE x4 Gen2====> TX2

 FPGA grabs the video stream and writes it to TX2's LPDDR4 directly. once a video frame is done FPGA will notifies
 the TX2's CPU-pack (4 ARM + 2 Denver) by raising a interruption to it.

 Note: there is a unique physical signal channel between the FPGA and TX2, i.e., PCIE. And we have implemented all 
       that is needed including video transfer, self-defined-register accesses, interruption raising, etc via the PCIE 

 But There Is Still A Problem Left Unresolved : 
 we want the devices (GPU stream units) to see the chunk of LPDDR4 which holds the video contents.
 inside the linux driver (for the FPGA, on behalf of the PCIE endpoint inside the FPGA),physically 
 contiguous memory is allocated as frame buffer. then the base address (already converted to bus address) of the 
 allocated memory is passed to the FPGA so that the logic inside it could write video stream directly to the allocated 
 memory. Once a interruption comes, that means a complete frame is collected.

 on the other hand, the call to 'mmap' of our driver, will expose the driver allocated memory to the user space,
 i.e., a user mode virtual memory address, marked as 'VA', will be returned.
 next, we want to process the video just in-place. that means we don't want any redundant memory movements of the 
 video content from one area of the LPDDR4 to another, but, this appears necessary in normal CUDA development 

 we have tried the pinned memory, but it seems that TX2 does not support the call to 'cudaHostRegister'.
 (an error will be returned, which means on the TX2 platform, the device can not touch the host allocated memory)
 so device (GPU) failed to see the user mode VA (i.e., the base address of the video buffer inside the driver, 
 returned by the call to 'mmap' of our linux driver on behalf of the FPGA's PCIE endpoint) directly.

 so, is there another way to implement in-place process or Zero-Copy between the host and device ?


Please try RDMA for your requirement. This would require the application to be partially re-written.

Finer details on the workflow with RDMA:
1. Create CUDA memory (as large as needed).
2. Mark the CUDA buffer as FPGA accessible with CUDA API (CU_POINTER_ATTRIBUTE_SYNC_MEMOPS flag in cuPointerSetAttribute)
3. Import underlying GPU_VA->PA using the nvmap dma-buf APIs
4. FPGA to fill the image.
5. Synchronize (ensure that image is composed, potential FPGA cache is flushed etc).
6. Launch CUDA kernel

Step 4 & 6 access the same underlying physical memory hence eradicating the need to copy.
Step 3 requires a custom kernel module that calls into Tegra’s NvMap ioctls to import memory.

One question, Does the FPGA require the allocated memory to be physically contiguous?
CUDA allocation (in 1) don’t give any such guarantee.
Step 3 returns list of physical pages where the memory is mapped.
If so, RDMA + CUDA will not help address this requirement.



RDMA is not supported on the TX2 , hence nvidia_p2p_ functions will fail as the device is unsupported.
I can perform steps 1 and 2. However I am still trying to figure out how to implement step 3. Do you have an example in which nvmap can be used to get from userspace device pointer to the physical/ kernel page addresses so that I can map them for PCIE/DMA?



You can find the kernel of TX2 in our download page: