PCIe DMA transfer performance issue with custom FPGA board on Jetson TX2


We are encoutering a performance issue while using one of our custom FPGA-based PCIe Gen2 x4 boards with a Jetson TX2 module (on the TX2 Dev Kit carrier board).
The custom board is a 64-bit-enabled DMA bus master that works perfectly on an x86 machine - with read (device to CPU) speeds up to 1.2Gbps.

However, when plugged into the Tegra, we observe that the speed is twice as small : approx. 600Mbps.
We are using a custom kernel module, using JetPack version 4.6.2 and the DMA mask is indeed 64-bits (set using pci_set_dma_mask and pci_set_consistent_dma_mask).
I cannot give the kernel module source code - however here is how we perform the DMA using our custom SG-list enabled controller :

  • the userspace passes a pre-allocated buffer to the kernel through an ioctl() system call
  • get_user_pages() to pin memory to the kernel
  • Init an SG-list with the pinned user pages
  • Map the list using dma_map_sg
  • Configure the transfer in our DMA controller (using a small dedicated coherent memory region)
  • dma_sync_sg_for_device()
  • Start the transfer → wait for transfer done (through IRQ)
  • dma_sync_sg_for_cpu()
  • dma_unmap_sg()
  • put_page() on user pages

We could not perceive any performance impact when removing the dma_sync_sg_for_{cpu,device} calls, nor by passing DMA_ATTR_SKIP_CPU_SYNC to the SG map or unmap calls.

However, if we completely skip the SG-list generation and allocate a (quite big, ~30MB) coherent buffer using dma_alloc_coherent(), and use this as the target of the DMA controller, the speed goes up to 1Gbps.
But, when trying to pass this buffer contents to the userspace by using copy_to_user(), it drops down to a mere 0.35Gbps. If I understand correctly, this is because dma_alloc_coherent() will return uncached memory.

We could not make sense of this behavior, but we are not accustomed to the ARM architecture.
We suspect this could be the result of a different cache-coherency system on ARM vs. x86, but could not figure out a way to achieve our desired throughput.

Something we haven’t tried so far is to remove IOMMU support for the PCIe root complex - however we would like our driver to work with a vanilla L4T kernel. Plus we really don’t know how disabling the IOMMU could alter the DMA performance (but it is something that comes up often on the forums, it seems).

Here are our questions :

  • Is there a way to efficiently copy from uncached memory returned by dma_alloc_coherent() to userspace ?
  • Is there a way to achieve proper bandwith using the classic SG-list DMA mechanism ?
  • Can the IOMMU be responsible of any performance alteration / improvement regarding PCIe ?

Thank you all in advance for your support,

Sorry for the late response, our team will do the investigation and provide suggestions soon. Thanks


Any news ?