Hi, all
i am conducting algorithm research on our customized TX2 platform, the topology of which is as follows:
Video-Graber <==Camera-Link-bus==> FPGA <====PCIE x4 Gen2====> TX2
FPGA grabs the video stream and writes it to TX2's LPDDR4 directly. once a video frame is done FPGA will notifies
the TX2's CPU-pack (4 ARM + 2 Denver) by raising a interruption to it.
-------------------------------------------------------------------------------------------------------------------
Note: there is a unique physical signal channel between the FPGA and TX2, i.e., PCIE. And we have implemented all
that is needed including video transfer, self-defined-register accesses, interruption raising, etc via the PCIE
bus
-------------------------------------------------------------------------------------------------------------------
But There Is Still A Problem Left Unresolved :
we want the devices (GPU stream units) to see the chunk of LPDDR4 which holds the video contents.
inside the linux driver (for the FPGA, on behalf of the PCIE endpoint inside the FPGA),physically
contiguous memory is allocated as frame buffer. then the base address (already converted to bus address) of the
allocated memory is passed to the FPGA so that the logic inside it could write video stream directly to the allocated
memory. Once a interruption comes, that means a complete frame is collected.
on the other hand, the call to 'mmap' of our driver, will expose the driver allocated memory to the user space,
i.e., a user mode virtual memory address, marked as 'VA', will be returned.
next, we want to process the video just in-place. that means we don't want any redundant memory movements of the
video content from one area of the LPDDR4 to another, but, this appears necessary in normal CUDA development
procedure.
we have tried the pinned memory, but it seems that TX2 does not support the call to 'cudaHostRegister'.
(an error will be returned, which means on the TX2 platform, the device can not touch the host allocated memory)
so device (GPU) failed to see the user mode VA (i.e., the base address of the video buffer inside the driver,
returned by the call to 'mmap' of our linux driver on behalf of the FPGA's PCIE endpoint) directly.
so, is there another way to implement in-place process or Zero-Copy between the host and device ?