Slow remote DMA write and read

At first, I disabled SMMU to see if the DMA operation is fast.
Based on the suggestion https://devtalk.nvidia.com/default/topic/1044034/jetson-tx2/slow-remote-dma-write-and-read/post/5297153/#5297153, I enabled SMMU and other necessary modifications as shown below:

drivers/pci/host/pci-tegra.c
    +msi->pages = __get_free_pages(GFP_DMA32, 0); (

    arch/arm64/configs/tegra18_defconfig
    +CONFIG_DEFAULT_DMA_COHERENT_POOL_SIZE=33554432

    kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
    +<&{/pcie-controller@10003000} TEGRA_SID_AFI>,

    +#stream-id-cells = <1>;

But, remote DMA failed (not able to write to other device even when sleep () is added)
The 3rd parameter of dma_alloc_coherent() is used as the bus address in the custom driver.

Output:

Sent Data:
Location - 1, val - 1
Location - 2, val - 2
Location - 3, val - 3
Location - 4, val - 4
Location - 5, val - 5

Received Data:
Location - 1, val - 0
Location - 2, val - 0
Location - 3, val - 0
Location - 4, val - 0
Location - 5, val - 0

With SMMU disabled and a sleep of 1 to 2 secs being added to the application, remote DMA operations succeed.

Output:

Sent Data:
Location - 1, val - 1
Location - 2, val - 2
Location - 3, val - 3
Location - 4, val - 4
Location - 5, val - 5

Received Data:
Location - 1, val - 1
Location - 2, val - 2
Location - 3, val - 3
Location - 4, val - 4
Location - 5, val - 5

Also, some of APIs of the custom application use Physical address.
Such as,

status = AllocMemory( hDrv, local.memSize, &local.memPhysAdrs, &local.memBusAdrs ); 
printf("1. Physical Memory[DATA] : %llx [%llx] (0x%x)\n",local.memPhysAdrs,local.memBusAdrs,local.memSize);

// Map a physical memory block to virtual space status = MapMemory( hDrv, local.memPhysAdrs, local.memSize, (PVOID*)&local.hSharedMemory, MM_NONCACHED ); 
printf("Virtual Address[DATA] : %llx\n",local.hSharedMemory);

status = MemDmaWriteRaw(hDrv,DMA_WAIT_COMPLETION,partner.devId,Channel, local.memBusSrc,partner.memBusAdrs,local.memSize,0);

So, I wonder how to get the physical address without using virt_to_phys () when SMMU is enabled.

But, for whatever reason, if you are using physical addresses directly (using macros like virt_to_phys() Etc…), then, having SMMU enabled wouldn’t work.

Therefore, do you recommend disabling SMMU and using the bus address as the physical address in the above APIs?
Or
Is it better to keep SMMU enabled and use an API to get the physical address?

With SMMU enabled, I don’t think there are any standard APIs available to get the physical address (as this is not in-line with kernel PCIe device driver writing approach). So, yes, if you want to work with the physical addresses, please keep SMMU disabled

Hi Vidyas,

Thank you for the explanation.

I suppose it is not a cache coherency issue as dma_alloc_coherent () returns non-cacheable memory.
Is there any other factor that affects cache coherence?

Also, I noticed the same issue of slow remote data write / read operation even when DMA is disabled.

I wonder if it is related to performance.

I don’t suspect this issue to be a Tegra specific issue, as we have a gazillion of cards working fine both with and without SMMU enabled. Can you look more into whether your FPGA is doing a delayed write or something?

Hi,

Thank you for your response.

From one of discussion threads, I learnt that there is a possibility of mmap() converting memory to cacheable even though dma_alloc_coherent is used.

To avoid it, I used pgprot_noncached () as follows:

// User space
mappedAdr = mmap(0, len+poff1, PROT_READ|PROT_WRITE, MAP_SHARED, handle, st-poff1);

// Kernel space
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
                    vma->vm_end - vma->vm_start,
                    vma->vm_page_prot))
return -EAGAIN;

But it resulted in Unhandled fault: alignment fault (0x92000061) at 0x0000007f9c453000.
Please be informed that SMMU is disabled.
Also, there was no issue when the same modification was done and executed on PC.

Could you please comment on this?
Is it recommended to replace remap_pfn_range() with dma_mmap_coherent()?

Apologies for the delay in reply.
Wondering why is “vma->vm_pgoff” used as the third argument to remap_pfn_range() API? Shouldn’t the physical address be passed here?