PCIe DMA transfer performance issue with custom FPGA board on Jetson TX2

am112 · June 30, 2022, 2:37pm

Hello,

We are encoutering a performance issue while using one of our custom FPGA-based PCIe Gen2 x4 boards with a Jetson TX2 module (on the TX2 Dev Kit carrier board).
The custom board is a 64-bit-enabled DMA bus master that works perfectly on an x86 machine - with read (device to CPU) speeds up to 1.2Gbps.

However, when plugged into the Tegra, we observe that the speed is twice as small : approx. 600Mbps.
We are using a custom kernel module, using JetPack version 4.6.2 and the DMA mask is indeed 64-bits (set using pci_set_dma_mask and pci_set_consistent_dma_mask).
I cannot give the kernel module source code - however here is how we perform the DMA using our custom SG-list enabled controller :

the userspace passes a pre-allocated buffer to the kernel through an ioctl() system call
get_user_pages() to pin memory to the kernel
Init an SG-list with the pinned user pages
Map the list using dma_map_sg
Configure the transfer in our DMA controller (using a small dedicated coherent memory region)
dma_sync_sg_for_device()
Start the transfer → wait for transfer done (through IRQ)
dma_sync_sg_for_cpu()
dma_unmap_sg()
put_page() on user pages

We could not perceive any performance impact when removing the dma_sync_sg_for_{cpu,device} calls, nor by passing DMA_ATTR_SKIP_CPU_SYNC to the SG map or unmap calls.

However, if we completely skip the SG-list generation and allocate a (quite big, ~30MB) coherent buffer using dma_alloc_coherent(), and use this as the target of the DMA controller, the speed goes up to 1Gbps.
But, when trying to pass this buffer contents to the userspace by using copy_to_user(), it drops down to a mere 0.35Gbps. If I understand correctly, this is because dma_alloc_coherent() will return uncached memory.

We could not make sense of this behavior, but we are not accustomed to the ARM architecture.
We suspect this could be the result of a different cache-coherency system on ARM vs. x86, but could not figure out a way to achieve our desired throughput.

Something we haven’t tried so far is to remove IOMMU support for the PCIe root complex - however we would like our driver to work with a vanilla L4T kernel. Plus we really don’t know how disabling the IOMMU could alter the DMA performance (but it is something that comes up often on the forums, it seems).

Here are our questions :

Is there a way to efficiently copy from uncached memory returned by dma_alloc_coherent() to userspace ?
Is there a way to achieve proper bandwith using the classic SG-list DMA mechanism ?
Can the IOMMU be responsible of any performance alteration / improvement regarding PCIe ?

Thank you all in advance for your support,
Regards,

kayccc · July 6, 2022, 4:44am

Sorry for the late response, our team will do the investigation and provide suggestions soon. Thanks

am112 · July 12, 2022, 9:57am

Hello,

Any news ?

Regards

Topic		Replies	Views
DMA from PCIe Device to Jetson Tx2 local DDR Jetson TX2	8	2379	September 11, 2018
Unexpected low performance of PCIe DMA to TX1 Jetson TX1	8	1573	May 8, 2017
Problem with PCIe throughput on TX1 Jetson TX1	5	985	October 18, 2021
DMA transfer between Jetson TK1 and PCIe Jetson TK1	7	4462	December 31, 2015
Cache issue on DMA buffers Jetson TX2 pcie , kernel	4	1130	October 18, 2021
Altera FPGA DMA to TX2 via PCIe problem Jetson TX2	18	3502	October 18, 2021
PCIe IOMMU Error Jetson TX2	3	1330	April 14, 2019
Slow CPU access to memory mapped DMA buffer Jetson AGX Xavier	8	2412	October 18, 2021
PCIE x4, only 658MB/s Jetson TX2	19	8996	October 18, 2021
Jetson TX2 + Xilinix PCIe Jetson TX2	8	7028	October 18, 2021

PCIe DMA transfer performance issue with custom FPGA board on Jetson TX2

Related topics