DMA buffer cache coherence with SMMU

We are testing a simple FPGA interfaced with an Orin AGX. The FPGA writes data over DMA which needs to be periodically checked (1kHz) to see if new data has arrived. The DMA engine in the FPGA writes mostly 4 byte words. Unfortunately, this data is written somewhat out of order. (the addresses output are a sequence like 1 2 3 0 5 6 4 8 9 7, with the DMA engine backing up a little bit every few words).

We have a lightweight kernel module that mostly allows a userspace driver to interact with this device. The kernel module allocates and owns the DMA area, but the userspace module actually initiates DMA operations (via writes to device registers) and examines the DMA buffer for data.

When this driver was originally tested on an Intel host, the DMA buffer was set up as completely uncached. We tried to replicate that setup on Orin, and were seeing some odd results. Here’s what happens:

  1. The kernel module allocates some memory using dma_alloc_coherent. (I also tried dma_alloc_noncoherent and dma_alloc_attrs).

  2. Userspace calls mmap on a character device, and the kernel module maps the DMA buffer into userspace. We’ve tried a few approaches:

2.1. vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot) followed by dma_mmap_coherent
2.2. vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot) followed by remap_pfn_range

  1. Userspace writes 0xFF over the whole DMA buffer

  2. Userspace triggers a DMA write by the device, and looks at the DMA buffer.

In this setup, we saw what seemed like 0-padding out to 64 byte boundaries in the DMA buffer. In other words, where we’d expect 24 bytes of data followed by 0xFF, we would see 24 bytes of data, followed by 40 0 bytes.

I did some reading, which suggested that the SMMU largely removes any need for using uncached memory and cache management. And, indeed, removing the pgprot_noncached bit and just using dma_alloc_coherent and dma_mmap_coherent did seem to get rid of the 0 padding we were seeing.

But that’s basically the diametric opposite of what we had in the Intel host setup – and I still don’t quite understand what level of coherence I should be expecting when looking at the DMA buffer from userspace. For example, is any reordering possible?

I was reading the SMMU docs here Documentation – Arm Developer but they say a lot of things are “IMPLEMENTATION DEFINED”

I would appreciate any additional documentation or explanation you can provide.

Yes, page table programming which sets coherency attribute would be sufficient to assure coherency across cpu and device.

That’s why no explicit dma cache sync APIs are required if you use above coherent flavor of allocation and mapping.

You check the pagetable dt attribute between intel and arm, you should be able to figure out which are bits set in pagetable.

Moreover in your device dt entry you must be setting “dma-coherent” property to tell that device is capable of doing coherent dma operation.

Thanks for the prompt reply. The DT “dma-coherent” entry you’re referring to – is that the one in the pcie properties for that RP?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.