[ 5517.386309] tegra30_mc_handle_irq: 238542 callbacks suppressed [ 5517.386330] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000

你好,
首先,我修改了设备树:

image

其次,修改了内核配置,增加:
CONFIG_ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT=n

然后重新烧写了系统,系统启动成功。

最后我新增了启动参数:

  APPEND ${cbootargs} root=/dev/nvme0n1p1 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 video=efifb:off console=tty0 pci=noaer pcie_aspm=off arm-smmu.disable=1 coherent_pool=512M default_hugepagesz=1G hugepagesz=1G hugepages=4 cma=1024M

重启设备,并加载了XDMA驱动。
在进行读写数据的时候遇见了一个异常日志,内核提示:

[ 222.595573] xdma:xdma_mod_init: Xilinx XDMA Reference Driver xdma v2020.2.2
[ 222.595576] xdma:xdma_mod_init: desc_blen_max: 0xfffffff/268435455, timeout: h2c 10 c2h 10 sec.
[ 222.595758] xdma:xdma_device_open: xdma device 0005:01:00.0, 0x00000000d33dfea5.
[ 222.596027] xdma 0005:01:00.0: enabling device (0000 → 0002)
[ 222.596246] xdma:map_single_bar: BAR0 at 0x2b28000000 mapped at 0x0000000013d44135, length=1048576(/1048576)
[ 222.596257] xdma:map_single_bar: BAR1 at 0x2b28100000 mapped at 0x0000000095878243, length=65536(/65536)
[ 222.596262] xdma:map_bars: config bar 1, pos 1.
[ 222.596263] xdma:identify_bars: 2 BARs: config 1, user 0, bypass -1.
[ 222.597628] xdma:xdma_thread_add_work: 0-H2C0-ST 0x00000000a6048c3f assigned to cmpl status thread cmpl_status_th1,1.
[ 222.606735] xdma:xdma_thread_add_work: 0-H2C1-ST 0x00000000d2e674fd assigned to cmpl status thread cmpl_status_th2,1.
[ 222.606934] xdma:xdma_thread_add_work: 0-C2H0-ST 0x00000000715ca1d1 assigned to cmpl status thread cmpl_status_th3,1.
[ 222.607063] xdma:xdma_thread_add_work: 0-C2H1-ST 0x000000002e9fab35 assigned to cmpl status thread cmpl_status_th0,1.
[ 222.607417] xdma:pci_keep_intx_enabled: 0005:01:00.0: clear INTX_DISABLE, 0x406 → 0x6.
[ 222.607465] xdma:probe_one: 0005:01:00.0 xdma0, pdev 0x00000000d33dfea5, xdev 0x0000000029df4b40, 0x00000000ff4404dc, usr 16, ch 2,2.
[ 222.608581] xdma:cdev_xvc_init: xcdev 0x0000000048939698, bar 0, offset 0x40000.
[ 328.279010] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279134] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279267] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279280] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279405] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279535] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279668] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279800] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279811] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 328.279867] tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)
[ 375.279025] tegra30_mc_handle_irq: 4272 callbacks suppressed

请问这是什么问题,如何解决呢?

Hi cct8263145,

Are you using the devkit or custom board for AGX Orin?
What’s the Jetpack version in use?

Those tegra-mc ... pcie5w ... EMEM address decode error logs almost always mean the PCIe endpoint is doing DMA to an invalid or unmapped IOVA (often 0x0) or into a protected VPR region, not an SoC MC bug.
Please check if the following items helping for your case:

  1. restore the default SMMU settings for that PCIe node in device‑tree (don’t delete iommus / iommu-map)
  2. make sure your PCIe driver only uses DMA buffers from the Linux DMA API and never programs IOVA 0x0 or other unmapped addresses

Hi,

I’m using the official Jetson AGX Orin 64GB Developer Kit running JetPack 6.2.1.

The PCIe endpoint is a custom FPGA card based on XDMA.

Currently, the data path is intentionally designed to bypass the Linux DMA mapping layer:

• A 1GB hugepage is allocated on the host.
• The hugepage provides a physically contiguous memory region.
• From userspace, I obtain the physical address of this hugepage buffer.
• This physical address is sent to the FPGA through a command/control channel.
• On the FPGA side, XDMA operates in bypass mode, directly issuing PCIe Memory Read/Write TLPs using this host physical address.

The purpose of this design is to minimize latency and software overhead, avoid extra copies or DMA mapping operations, and allow the FPGA to directly access system DRAM for maximum PCIe throughput.

However, after running this configuration, the system reports:

tegra-mc … EMEM address decode error

Based on previous discussions, this appears to indicate that the PCIe endpoint is issuing DMA transactions targeting an invalid or untranslated address.

My understanding is that Jetson AGX Orin enables SMMU/IOMMU by default, and PCIe DMA addresses may be interpreted as IOVA rather than physical addresses.

My questions are:

  1. Does bypassing the Linux DMA API and issuing DMA directly to host physical addresses conflict with the default SMMU/IOMMU configuration on Jetson AGX Orin?

  2. Is direct physical-address DMA from userspace officially supported on Orin platforms?

  3. If supported, what is the correct configuration or workflow to safely implement this model?

  4. If not supported, what is the recommended method to achieve near-zero-copy, low-latency, high-throughput DMA between a PCIe FPGA endpoint and system memory?

My goal is to keep CPU usage and latency as low as possible while sustaining maximum PCIe bandwidth.

Any guidance or recommended best practices for FPGA/XDMA integration on Orin platforms would be greatly appreciated.

Best regards,

On Orin you cannot safely do PCIe DMA to “CPU physical addresses” from userspace while SMMU is enabled. The SMMU expects IOVA programmed by the kernel, not raw PAs, so your current XDMA‑bypass design will keep hitting EMEM address decode error.

We do not support a model where userspace grabs a physical address and tells the FPGA to DMA to it directly. The supported way to get low‑latency, high‑bandwidth DMA is:

  • write a small kernel driver for your FPGA
  • let the driver allocate / map a big buffer with the Linux DMA API
  • program the returned DMA address (IOVA) into the FPGA
  • optionally mmap that buffer into userspace if you need CPU access

You can disable SMMU in DT to make raw PAs “work”, but this is not recommended or supported for production because it removes memory protection and any buggy/compromised FPGA can corrupt all system RAM.

Hi,

Thanks for the clarification.

I would like to further confirm a few technical details regarding the SMMU/IOMMU behavior on Jetson AGX Orin.

On my system, I tried the following configurations in order to bypass the IOMMU so that the FPGA could use host physical addresses directly:

• In the PCIe device-tree node, I temporarily removed the following properties:

  • iommu-map
  • iommu-map-mask
  • iommus

• Kernel configuration:

  • CONFIG_ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT = n

My questions are:

  1. With the above configuration, is the IOMMU expected to be effectively bypassed on Jetson AGX Orin, allowing a PCIe endpoint device to access host physical addresses directly?

  2. If this configuration should work, what could still cause the following error during PCIe DMA access?

tegra-mc ... pcie5w ... EMEM address decode error

Is there a recommended way to eliminate or avoid this error?

  1. If bypassing the IOMMU in this way is not supported or not effective on Orin, what is the correct method to configure IOMMU passthrough for a PCIe endpoint device?

The reason I am exploring this approach is that my FPGA design expects to perform DMA using host physical addresses directly (XDMA bypass mode), in order to minimize software overhead and maximize PCIe throughput.

Any guidance on the recommended architecture for this use case would be greatly appreciated.

Thanks.

On Orin, removing iommu-map* / iommus from the PCIe node plus CONFIG_ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT=n does effectively bypass SMMU translation for that controller, but this is not a supported / recommended configuration on Jetson.

Even in that mode, the memory controller still enforces DRAM range and carveout protection. tegra-mc … pcie5w … EMEM address decode error means your FPGA is still DMA‑ing to an address that is not a valid non‑secure DRAM location (or is inside a protected region like VPR), so the MC correctly rejects it.

NVIDIA does not support a “userspace gets a PA and FPGA DMAs to it directly” model on Orin. The recommended way is to keep SMMU enabled and:

  • write a small kernel driver,
  • allocate/pin a large buffer via the Linux DMA API,
  • program the returned DMA address (IOVA) into the FPGA, and
  • mmap that buffer to userspace if needed.

If you continue to bypass SMMU, you must manually ensure all FPGA DMA targets are within valid DRAM and outside any carveouts, and accept the security/safety risks of giving the endpoint unrestricted access to system memory.

Hi,

Thanks for the explanation.

At the moment I am bypassing the SMMU using the following configuration:

• In the PCIe device-tree node, I temporarily removed the following properties:

  • iommu-map

  • iommu-map-mask

  • iommus

• Kernel configuration:

  • CONFIG_ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT = n

With this setup, my application is able to run normally.
The FPGA can perform DMA to host memory, and both transmit and receive data appear to be correct during testing.

However, I still observe repeated messages such as:

tegra-mc ... pcie5w ... EMEM address decode error

Based on your explanation, this suggests that the memory controller rejected a DMA access because the target address is not considered a valid non-secure DRAM location or falls within a protected region.

What I would like to understand is:

If the memory controller is rejecting those accesses, why do the DMA transfers still appear to work correctly from the application perspective?

Does this mean that only some DMA transactions are rejected (for example accesses outside the valid DRAM range), while the actual data buffer accesses remain valid?

Any clarification on how the memory controller reports and handles such cases would be very helpful.

Thanks.

Yes, only some of your FPGA DMA transactions are being rejected.

The tegra-mc ... pcie5w ... EMEM address decode error lines mean the FPGA sometimes issues a PCIe read/write to an address the MC considers invalid (outside DRAM or inside a protected region). Those specific transactions are dropped, but the rest of the DMA traffic — the accesses that hit your real buffer — still go through, so your app “looks” correct.

In practice this usually comes from things like prefetch/stride past the end of the buffer, or a small offset bug in the address generator. To get rid of the errors you either need to (a) fix the FPGA so no access ever goes outside the valid buffer range, or (b) move to the supported model with SMMU enabled and a kernel driver using the Linux DMA API, so the device only uses a properly mapped IOVA range.

Hi,

Thanks for the explanation.

While reviewing the logs more carefully, I noticed that the reported fault address is always 0x0. For example:

tegra-mc 2c00000.memory-controller: pcie5w: write @0x0000000000000000: EMEM address decode error (EMEM decode error)

However, in our FPGA design we do not intentionally generate any DMA transactions targeting address 0x0.

In our current setup, the FPGA receives the host buffer physical address from userspace (hugepage buffer) and performs DMA only within that buffer range. The buffer base address is non-zero and the DMA engine should not issue accesses to address 0x0.

Because of this, I am trying to understand why the MC reports the fault address as 0x0.

Is it possible that:

  • the reported address 0x0 is a placeholder when the MC cannot decode the actual address, or

  • the error is triggered by some speculative/prefetch access from the PCIe DMA engine?

Or does this strictly mean that a PCIe transaction with address 0x0 was actually observed on the bus?

Any clarification on how the memory controller reports the fault address in this situation would be very helpful.

Thanks.

On Orin this log is not a placeholder – if MC prints

pcie5w: write @0x0000000000000000: EMEM address decode error

it means it really saw a PCIe write with address 0x0 (or effectively 0 after masking). So even if your main data buffer is non‑zero, some path in the XDMA/FPGA (e.g. an uninitialized descriptor, doorbell pointer, or a prefetch that runs past the buffer) is occasionally using 0 as the target. Those writes to 0 are blocked by MC, so your main DMA still looks fine, but the log is telling you there is a corner case in your address generation you need to fix.