Bug Report: L4T 5.10 Kernel: DMA_API_DEBUG=y produces stack traces

I have enclosed the full kernel log which contains the stack traces.

  1. DMA-API: arm-smmu 12000000.iommu: device driver tries to sync DMA memory it has not allocated [device address=0x00000007fdcad000] [size=8 bytes]

  2. DMA-API: exceeded 7 overlapping mappings of cacheline 0x0000000001800000

I have recompiled the L4T 5.10 kernel with and without the RT patches applied, in both cases the stack traces exist. The first stack trace is alarming as 8 bytes of garbage are (possibly) being written to a device. You can reproduce these errors by adding the following to 5.10 kernel config:

  1. CONFIG_DMA_API_DEBUG=y
  2. CONFIG_DMA_API_DEBUG_SG=y

L4T kernel commit 150754e3de53c81a0be6aab613700bd15414790c “iommu: arm-smmu: io-pagetable: Add dma_sync API” appears to be related to stack trace #1. The relevant kernel source is from the following:

git clone “https://nv-tegra.nvidia.com/r/linux-5.10
git checkout origin/l4t/l4t-r35.1.ga-5.10

I’m concerned these DMA errors may indicate some instability in the system that may be exacerbated by the RT patchset.

l4t-dmesg.txt (77.0 KB)

I cannot debug this, but wanted to add some comments…

Is this a dev kit? Or custom board?

You might also want to post a copy of the “/proc/config.gz” to show the entire config. Also, if you changed something with “=y” in the config, then it is recommended that the “CONFIG_LOCALVERSION” change, e.g., so it is no longer “-tegra”, and perhaps something like “-testing”, and build and install all modules new. If that is done then the stack trace probably means more (although no garbage should ever be written).

Also, I don’t see garbage, but I do see spurious interrupt. A spurious interrupt is different than garbage data, and it is important to know if there really is garbage data, and if so, where and how you see the garbage data.

This is an Orin AGX dev kit. I’ve included the config file.

I needed to compile the kernel and install new modules as I have applied the RT patchset, this includes the out-of-tree overlay drivers. I apologize I should have changed the LOCALVERSION. Note, the stack dumps occur with and without RT patches applied.

During the initialization of the Display Controller Engine (DCE), a call to dma_alloc_coherent is performed. While the initial memory is being allocated, changes to the arm-smmu, by NVIDIA, attempt to sync all the iommu and pagetable entries.

The warning states that the driver is attempting to sync 8 bytes of memory to DMA that it has not allocated. Writing 8 bytes to an unallocated region could be harmless if no one owns that region or a disaster if something else is using it.

No other driver produces this error, yet many perform calls to dma_alloc_coherent. There may not be an issue with the downstream NVIDIA changes but with this driver allocating memory out of order.

5.10.104.rt63-config.txt (219.3 KB)

NVIDIA will have to answer this one. There are times when write must be an entire line (aligned to some boundary), and it might write parts of the line not allocated, but it sounds suspicious to write 8 bytes not allocated.

Hi,
We can observe the issue. It is under investigation. Will update once there is further finding.

Thank you for investigating this issue.

I have found several additional issues (and fixes) that I intend to report. Is this forum the best place to post additional bug reports or is there a more direct method of communication with L4T kernel developers?

Again, thank you for the time. These fixes will allow me to run and debug real-time applications on the AGX Orin.