Our application uses a large memory-mapped region as a circular buffer that is filled by DMA from a PCIe card and read by the application. The Nsight profiler shows that a memcpy from this buffer runs much much slower than a memcpy between malloc’ed buffers in the application. Because of this, we can’t achieve the performance level required. The problem is likely due to memory obtained by dma_alloc_coherent() being non-cachable since the same application and driver work fine under Ubuntu on x86_64 hardware.
We are using JetPack 4.2.2 on an Xavier. The TX2 also shows low performance.
In order to port our driver to ARM, we disabled the SMMU as discussed in this forum. (We removed iommus = <&smmu TEGRA_SID_PCIE5>; and dma-coherent; from the device tree.) We’re running this way since we haven’t yet found the recipe for running with the IOMMU enabled. (We were getting Unhandled context faults.)
I’m looking for suggestions on how to proceed.
Would running with the IOMMU enabled help? (I’d like to fix this, in any case, to avoid having to modify the device tree.)
Would it be better to switch to streaming buffer operation?
I’ve started looking into the NvBuffer code suggested in another post.
Any suggestions on what else I should look into? Thanks.