I have a device driver on the TX2 that can allocate DMA-able coherent memory using dma_alloc_coherent. The application typically allocates 2 buffers of ~256MBytes each. 2 FPGA DMA engines simultaneously push data into the DMA buffers in 128MB chunks. The DMA buffer is memory mapped by the host application so there is user-space access to the coherent memory. Once the data is delivered, it would be ideal if the GPU could begin processing it. At the present moment, the application copies the data to a CUDA managed buffer and then CUDA kernels are executed.
My questions are:
Is it possible to avoid the copy due to the unified memory on the TX2 tegra? If so, how?
Is GPU pinned memory coherent such that a DMA engine could deliver data to a pinned-memory buffer? If so, maybe the application could avoid the dma_alloc_coherent step and instead allocate pinned memory through the CUDA API calls. Then, point the DMA engine at the cuda-allocated pinned memory.
Is it possible to label the linux kernel driver memory buffer as “GPU-accessible” such that the memory allocated by dma_alloc_coherent could be accessible by the cuda kernel?
1. There are some topics relates to GPU access for dma_alloc_coherent buffer before.
The conclusion is that if the buffer is cacheable, it should work with EGL mapping.
However, we haven’t received any feedback from the user about the result.
2.
Pinned memory needs to be a paged-lock host memory.
We don’t have too much experience on the dma_alloc_coherent buffer.
It’s recommended to give it a try directly.
3.
No. The access needs to go through EGL mapping.
You will need to make sure the EGL mapping is working first.
Is there an EGL mapping example available? I have not used the egl API.
cudaHostRegister looked promising but is not supported on devices with compute capability less than 7.2. I have Jetson TX2 which is compute capability 6.2.
dma_alloc_coherent reserves “device” memory on ARM architectures, which is typically defined as bufferable and non-cacheable memory.