I have a device driver on the TX2 that can allocate DMA-able coherent memory using dma_alloc_coherent. The application typically allocates 2 buffers of ~256MBytes each. 2 FPGA DMA engines simultaneously push data into the DMA buffers in 128MB chunks. The DMA buffer is memory mapped by the host application so there is user-space access to the coherent memory. Once the data is delivered, it would be ideal if the GPU could begin processing it. At the present moment, the application copies the data to a CUDA managed buffer and then CUDA kernels are executed.
My questions are:
Is it possible to avoid the copy due to the unified memory on the TX2 tegra? If so, how?
Is GPU pinned memory coherent such that a DMA engine could deliver data to a pinned-memory buffer? If so, maybe the application could avoid the dma_alloc_coherent step and instead allocate pinned memory through the CUDA API calls. Then, point the DMA engine at the cuda-allocated pinned memory.
Is it possible to label the linux kernel driver memory buffer as “GPU-accessible” such that the memory allocated by dma_alloc_coherent could be accessible by the cuda kernel?