Slow CPU access to memory mapped DMA buffer

Our application uses a large memory-mapped region as a circular buffer that is filled by DMA from a PCIe card and read by the application. The Nsight profiler shows that a memcpy from this buffer runs much much slower than a memcpy between malloc’ed buffers in the application. Because of this, we can’t achieve the performance level required. The problem is likely due to memory obtained by dma_alloc_coherent() being non-cachable since the same application and driver work fine under Ubuntu on x86_64 hardware.

We are using JetPack 4.2.2 on an Xavier. The TX2 also shows low performance.

In order to port our driver to ARM, we disabled the SMMU as discussed in this forum. (We removed iommus = <&smmu TEGRA_SID_PCIE5>; and dma-coherent; from the device tree.) We’re running this way since we haven’t yet found the recipe for running with the IOMMU enabled. (We were getting Unhandled context faults.)

I’m looking for suggestions on how to proceed.

Would running with the IOMMU enabled help? (I’d like to fix this, in any case, to avoid having to modify the device tree.)

Would it be better to switch to streaming buffer operation?

I’ve started looking into the NvBuffer code suggested in another post.

Any suggestions on what else I should look into? Thanks.

Is the buffer
a) Allocated in the driver and then exposed to the user space or
b) Allocated in the user space and mapped to enable PCIe device dumping data to it?
In any case, what is the procedure & APIs being used to achieve this? That might give us some clue.

Also, disabling SMMU would certainly make things not so effective in terms of performance. If you are getting context faults with that, certainly there is something wrong in the driver (i.e. the driver is not adhering to PCIe device driver writing guidelines). You may have to fix that aspect as well.

The buffer is allocated in the driver. An ioctl is used to pass down a dma_memory_handle_t with the size, do the alloc (via dma_zalloc_coherent), and pass the address and virt_to_phys() addresses back up in the dma_memory_handle_t.

The file_operations.mmap function performs:

vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range(vma, vma->vm_start, context->dmas[idx].paddr >> PAGE_SHIFT, vma->vm_end - vma->vm_start, vma->vm_page_prot)

I’d really like to fix the SSMU disable. Can you give me a pointer to the PCIe device driver writing guidelines? I inherited this project and don’t have the original author’s materials. I looked into this before. See
https://devtalk.nvidia.com/default/topic/1049733/jetson-tx2/ubuntu-pcie-driver-port-to-l4t-gets-unhandled-context-fault/post/5372519/#5372519 and https://devtalk.nvidia.com/default/topic/1060880/jetson-agx-xavier/pcie-dma-driver-compatibility-with-xavier-smmu-iommu-/post/5372539/#5372539. We got the SMMU disable to work and hadn’t followed up yet.

I can make the driver code available privately if that would help. Note: This is the same driver we currently use successfully on Ubuntu on x86_64.

virt_to_phys() is not guaranteed to work with SMMU enabled. It may work fine on x86 but doesn’t have to on other platforms (certainly not on Tegra). I think this needs to be fixed in the driver.
You can refer to https://www.kernel.org/doc/Documentation/DMA-API.txt on how to use DMA APIs.
I don’t think I got the following clearly.
“An ioctl is used to pass down a dma_memory_handle_t with the size, do the alloc (via dma_zalloc_coherent), and pass the address and virt_to_phys() addresses back up in the dma_memory_handle_t.”

Only after dma_zalloc_coherent() is done, can we get a valid dma_handle right? So, what are we passing down in ioctl before dma_zalloc_coherent() is called?

You can share with me your driver privately and I can also take a look at it.

Regarding the discussion of the slow memory access:
Let me try to fix my description of the buffer allocation. The buffer is allocated in the driver as part of this flow: After the application opens the device, the buffer size is passed down via an ioctl where the buffer is allocated (with dma_zalloc_coherent). The driver initializes a structure with various address information which is then copied back to userspace. In userspace, mmap then invokes the driver again to use pgprot_noncached() and remap_pfn_range() to map the buffer into userspace.

Regarding not disabling the SMMU:
I’ve read through DMA-API.txt and looked back through the forum. In https://devtalk.nvidia.com/default/topic/1044034/jetson-tx2/slow-remote-dma-write-and-read/1 you said, “with SMMU enabled, any allocation/mapping to be used by PCIe endpoint device are shown as cached regions to CPU and coherency is maintained at the hardware level.”

Does this conflict with https://devtalk.nvidia.com/default/topic/1045326/jetson-agx-xavier/does-xavier-support-coherent-dma-/post/5303979/#5303979?

You also said, “If you use dma_alloc_coherent() API, it returns both bus address (which can be given to endpoint to dump data into system memory) and CPU virtual address to let CPU access the same memory.”

I’ll remove the virt_to_phys() from the driver and test the use of bus address in our logic.

I’ve zipped up our driver and will send that separately.

Hello vidyas,

I have not been able to resolve the problem of running with the SMMU enabled. Removing virt_to_phys() and operating with the dma_handle returned by dma_alloc_coherent() is not causing faults, but I’m not receiving good data. I’m still digging through our buffer handling code for this.

Did you receive the driver code I attached to a private message?

I saw your comment in https://devtalk.nvidia.com/default/topic/1063662/jetson-tx2/iommu-unhandled-context-fault-/post/5386189/#5386189 and have been looking into the upstreamed drivers you mentioned.

The slow copy speed has been resolved. A change had to be made in the vm_page_prot setting. There is still a strong desire to resolve running with SMMU enabled.

Did you resolve running with the SMMU?