We are testing a simple FPGA interfaced with an Orin AGX. The FPGA writes data over DMA which needs to be periodically checked (1kHz) to see if new data has arrived. The DMA engine in the FPGA writes mostly 4 byte words. Unfortunately, this data is written somewhat out of order. (the addresses output are a sequence like 1 2 3 0 5 6 4 8 9 7, with the DMA engine backing up a little bit every few words).
We have a lightweight kernel module that mostly allows a userspace driver to interact with this device. The kernel module allocates and owns the DMA area, but the userspace module actually initiates DMA operations (via writes to device registers) and examines the DMA buffer for data.
When this driver was originally tested on an Intel host, the DMA buffer was set up as completely uncached. We tried to replicate that setup on Orin, and were seeing some odd results. Here’s what happens:
-
The kernel module allocates some memory using
dma_alloc_coherent
. (I also trieddma_alloc_noncoherent
anddma_alloc_attrs
). -
Userspace calls
mmap
on a character device, and the kernel module maps the DMA buffer into userspace. We’ve tried a few approaches:
2.1. vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot)
followed by dma_mmap_coherent
2.2. vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot)
followed by remap_pfn_range
-
Userspace writes 0xFF over the whole DMA buffer
-
Userspace triggers a DMA write by the device, and looks at the DMA buffer.
In this setup, we saw what seemed like 0-padding out to 64 byte boundaries in the DMA buffer. In other words, where we’d expect 24 bytes of data followed by 0xFF, we would see 24 bytes of data, followed by 40 0 bytes.
I did some reading, which suggested that the SMMU largely removes any need for using uncached memory and cache management. And, indeed, removing the pgprot_noncached
bit and just using dma_alloc_coherent
and dma_mmap_coherent
did seem to get rid of the 0 padding we were seeing.
But that’s basically the diametric opposite of what we had in the Intel host setup – and I still don’t quite understand what level of coherence I should be expecting when looking at the DMA buffer from userspace. For example, is any reordering possible?
I was reading the SMMU docs here Documentation – Arm Developer but they say a lot of things are “IMPLEMENTATION DEFINED”
I would appreciate any additional documentation or explanation you can provide.