PCIE bus error (console information log)

The console shows the following pcie bus error log
What is the impact on PCIE devices?

[ 102.779116] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 102.779118] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 102.779121] pcieport 0001:00:00.0: [12] Timeout
[ 102.779641] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 102.779643] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 102.779645] pcieport 0001:00:00.0: [12] Timeout
[ 102.780284] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 102.780285] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 102.780288] pcieport 0001:00:00.0: [12] Timeout
[ 102.786086] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 102.786088] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 102.786091] pcieport 0001:00:00.0: [12] Timeout
[ 102.934002] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 102.934005] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 102.934008] pcieport 0001:00:00.0: [12] Timeout
[ 145.776864] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 145.776868] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 145.776872] pcieport 0001:00:00.0: [12] Timeout
[ 145.786781] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 145.786784] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 145.786787] pcieport 0001:00:00.0: [12] Timeout
[ 145.874489] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 145.874492] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 145.874495] pcieport 0001:00:00.0: [12] Timeout
[ 209.767260] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 209.767263] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 209.767267] pcieport 0001:00:00.0: [12] Timeout
[ 209.816840] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 209.816843] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000
[ 209.816846] pcieport 0001:00:00.0: [12] Timeout
[ 304.786046] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 304.786051] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00001000/0000e000

In theory the problem is corrected, but it isn’t really possible to know based on what is presented.

For background, many PCIe devices have an optional “Advanced Error Correction”, and this allows not only detecting various error types, but often fixing those errors. The nature of the error depends on the specific error, and although often the problem is one of signal quality, it also is not unusual for this to be related to a software issue, e.g., a mismatched driver or argument passed to the driver.

Note that the particular PCIe device itself defines much of this. Yours is apparently at slot 0001:00:00.0. Normally one would use lspci to find out more information. Some information on this:

  • lspci is a brief view of all known PCIe devices. Jetsons don’t have a lot, but it might list PCIe bridges for example in addition to the device itself.
  • One has to use sudo to find the most verbose format of lspci.
  • To view only the slot your error message is about, and to simultaneously create a log file you can attach to the forum:
    sudo lspci -s '0001:00:00.0' -vvv 2>&1 | tee log_lspci.txt

With that you could see verbose information about the specific device, and then attach a copy to the forum. More information would probably be available then.

We would also need to know the exact model of Jetson. This includes whether there is a custom or third party carrier board involved, or if this is purely a developer’s kit. I suggest adding this information:

  • cat /etc/nv_boot_control.conf
  • head -n 1 /etc/nv_tegra_release
  • Have there been any device tree modifications, and if so what?
  • If this is a PCIe device you installed, add details what the device is; if not, then specify you don’t have any optional PCIe hardware (including m.2 slot).

Dear Linuxdev

1: Jeton Orin Nano
2: The company developed its own board based on the Jetson Orin Nano line

3: nv_boot_control.conf`
TNSPEC 3767-300-0003-P.1-1-1-jetson-orin-nano-devkit-
COMPATIBLE_SPEC 3767–0003–1–jetson-orin-nano-devkit-
TEGRA_BOOT_STORAGE nvme0n1
TEGRA_CHIPID 0x23
TEGRA_OTA_BOOT_DEVICE /dev/mtdblock0
TEGRA_OTA_GPT_DEVICE /dev/mtdblock0

4:head -n 1 /etc/nv_tegra_release
# R36 (release), REVISION: 4.3, GCID: 38968081, BOARD: generic, EABI: aarch64, DATE: Wed Jan 8 01:49:37 UTC 2025

5: There is no change to the PCIE settings. Only io_expansion is added to control peripheral power.

6:PCIE message
0001:00:00.0 PCI bridge: NVIDIA Corporation Device 229e (rev a1)
0001:01:00.0 Network controller: Realtek Semiconductor Co., Ltd. Device c852 (rev 01)
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 229c (rev a1)
0004:01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. Device 2269 (rev 03)
0008:00:00.0 PCI bridge: NVIDIA Corporation Device 229c (rev a1)
0008:01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

sudo lspci -s ‘0001:00:00.0’ -vvv 2>&1 | tee log_lspci.txt
log_lspci_s.txt (5.1 KB)

Since the occurrence is very random, there is no pcie problem now. We have to wait for it to happen before capturing the relevant logs.

The information you’ve added is good to have:

  • You’ve booted to an external device (nvme0n1p1) using a mainline kernel (L4T R36.x uses mainline).
  • The carrier board is flashed as a dev kit.
    • Can you verify that the hardware itself is in fact truly a developer’s kit (probably it is, but this needs to be asked)? Sometimes third party carrier boards end up on an NVIDIA carrier board, which works, but it is important to know that the carrier board is in fact what the software is designed to work with during debugging.
  • The device at slot ‘0001:00:00.0’ is part of NVIDIA’s devices. It is a PCIe bridge. This means that the device and the device attached to the bridge need to be considered together. For that it would be useful to have a tree view of lspci:
    • lspci -tv
    • With logging for a file you can attach to the forum:
      lspci -tv 2>&1 | tee log_pci_tree.txt

You can provide the tree view of lspci now, but we will need the verbose lspci on that specific slot after some errors have occurred. Assuming the dmesg logs show the same PCIe slot (the bridge) of ‘0001:00:00.0’, then whenever you find the next error:
sudo lspci -s ‘0001:00:00.0’ -vvv 2>&1 | tee log_pcie_error.txt
(then attach log_pcie_error.txt)

If you post the tree view of lspci now, then we can figure out what slots the bridge might be serving. When the error occurs on the bridge it is possible that we might be interested in knowing what device that bridge serves and getting a verbose lspci on the device being served even if that device is not itself showing an error. PCIe devices do often have sub-devices though, and so the tree view slot naming might need an explanation when describing what the slot is that the bridge serves. We can get that knowledge out ahead of time and then see if there are downstream errors as well as bridge errors.

If it turns out that the device being served by the bridge is the NVMe, then we might ask more questions about the NVMe, but don’t bother for now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.