At first, I disabled SMMU to see if the DMA operation is fast.
Based on the suggestion https://devtalk.nvidia.com/default/topic/1044034/jetson-tx2/slow-remote-dma-write-and-read/post/5297153/#5297153, I enabled SMMU and other necessary modifications as shown below:
drivers/pci/host/pci-tegra.c
+msi->pages = __get_free_pages(GFP_DMA32, 0); (
arch/arm64/configs/tegra18_defconfig
+CONFIG_DEFAULT_DMA_COHERENT_POOL_SIZE=33554432
kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
+<&{/pcie-controller@10003000} TEGRA_SID_AFI>,
+#stream-id-cells = <1>;
But, remote DMA failed (not able to write to other device even when sleep () is added)
The 3rd parameter of dma_alloc_coherent() is used as the bus address in the custom driver.
Output:
Sent Data:
Location - 1, val - 1
Location - 2, val - 2
Location - 3, val - 3
Location - 4, val - 4
Location - 5, val - 5
Received Data:
Location - 1, val - 0
Location - 2, val - 0
Location - 3, val - 0
Location - 4, val - 0
Location - 5, val - 0
With SMMU disabled and a sleep of 1 to 2 secs being added to the application, remote DMA operations succeed.
Output:
Sent Data:
Location - 1, val - 1
Location - 2, val - 2
Location - 3, val - 3
Location - 4, val - 4
Location - 5, val - 5
Received Data:
Location - 1, val - 1
Location - 2, val - 2
Location - 3, val - 3
Location - 4, val - 4
Location - 5, val - 5