PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0010(Receiver ID)

Hi, I am getting PCIe Bus Error. I am using Custom Carrier Board with 500GB SSD and Orin NX SOM.

Whenever, I access OrinNX using ssh and see the dmesg then pcie bus error is shown in the backend.

Not possible to give you any answer by just this one line comment.

Bus error could be from hardware or any kind of causes. If you want to find out the cause, dump the PCIe LA trace by using PCIe analyzer.

Also I suggest these logs (showing how to create a log file of these commands):

  • sudo lspci -vvv 2>&1 | tee log_lspci_verbose.txt
  • lspci -t -v 2>&1 | tee log_lspci_tree.txt

Here is the info of the provided cmd .txt file
lspci -vvv 2>&1 | tee log_lspci_verbose.txt (17.7 KB)
lspci -t -v 2>&1 | tee log_lspci_tree.txt (1.3 KB)

There are actually a lot of Intel devices showing up on the PCI bus, and a Realtek gigabit ethernet adapter. Can you tell us exactly what is plugged in to the PCIe bus? Sometimes one device can interfere with another (I have no reason to believe this is the case, but we should know what is connected), even if it is as simple as consuming more power than the bus works well with, or noise from one affecting the other even if just when some particular data pattern is present (the NVMe is running at its full speed so this is unlikely…noise tends to cause it to drop back to a slower protocol).

The NVMe is capable of PCIe v4 speeds, and is running at that speed (a good sign for signal quality and the physical layer). The gigabit controller is capable of PCIe v1 speed, and is running at that speed. The NVMe is the current suspect, I’ll mostly ignore other devices.

Also, would it be possible to get a log of the error instead of a screenshot for dmesg? An example of getting the entire dmesg:
dmesg 2>&1 | tee log_dmesg.txt
(even if the image were not out of focus it is hard to copy and paste for values; there may have been a kernel OOPS or error mentioned earlier as well, and any kernel error to a specific driver’s stack frame is very useful since I’m thinking the PHY is not an issue)

It would actually be better to get a full serial console boot log instead of dmesg, although dmesg is useful (we can live with dmesg if needed). What dmesg fails to show is the setup of PCIe prior to reaching the Linux kernel. Setup state prior to handing over to Linux tends to matter, and serial console will include everything dmesg shows.

The screenshot does make it obvious that the NVMe is where the error is from, but one kind of needs to see what comes before this to have context. We know this is a symptom, but we don’t know yet why or when the issue starts (that information is not yet available to us).

This is for reference, it is the NVMe subset of the logs:

0004:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a (prog-if 02 [NVM Express])
	Subsystem: Samsung Electronics Co Ltd Device a801
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 56
	Region 0: Memory at 2428000000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR+
			 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=130 Masked-
		Vector table: BAR=0 offset=00003000
		PBA: BAR=0 offset=00002000
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [178 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [1bc v1] Lane Margining at the Receiver <?>
	Capabilities: [214 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [21c v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L0004:01:00.01.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [3a0 v1] Data Link Feature <?>
	Kernel driver in use: nvme

If one were to use lspci to show just this device slot, include option “-s 0004:01:00.0”:
lspci -s 0004:01:00.0

No bits are set in the advanced error reporting bits (AER first error pointer).

I think knowing what the issue is mandates seeing the actual log of the kernel (at least from dmesg).

Here is the full dmesg log
log_dmesg.txt (130.8 KB)

That isn’t actually the full dmesg log. The start of kernel boot is missing. A full serial console boot log up until the error hits would be far better because this logs time before the kernel starts (a full dmesg log would have to include the kernel start as well as the error; that particular log is just a repeat of the same error many times).

Btw, if you see the word “quiet” anywhere in “/boot/extlinux/extlinux.conf”, then I would recommend removing it prior to any serial console boot log.

I have flashed OrinNX with new SSD. Please find the attached full dmesg log.
bootuplogs_dmesg.txt (70.2 KB)

Only NVME SSD is plugged in to the PCIe bus, no other devices are plugged.

There is no word “quiet” anywhere In/boot/extlinux/extlinux.conf

Here the extlinux.conf textfile:

DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
MENU LABEL primary kernel
LINUX /boot/Image
FDT /boot/dtb/kernel_tegra234-p3767-0000-p3509-a02.dtb
INITRD /boot/initrd
APPEND ${cbootargs}

When testing a custom kernel, it is recommended that you create a backup of
the original kernel and add a new entry to this file so that the device can
fallback to the original kernel. To do this:

1, Make a backup of the original kernel
sudo cp /boot/Image /boot/Image.backup

2, Copy your custom kernel into /boot/Image

3, Uncomment below menu setting lines for the original kernel

4, Reboot

LABEL backup
MENU LABEL backup kernel
LINUX /boot/Image.backup
FDT /boot/dtb/kernel_tegra234-p3767-0000-p3509-a02.dtb
INITRD /boot/initrd
APPEND ${cbootargs}

What do you see from “lsblk -f”? I want to be sure which partitions are actually used (the NVMe is PARTUUID=f780356d-e6d8-4811-81ee-6171f451907c, but “/boot” is probably on eMMC if it is an eMMC model and not a dev kit…custom boards are supposed to use eMMC modules). I see no mention in the log of the .dtb file, and to see this we likely must have a full serial console boot log instead of just a dmesg (note that a DTS name is not necessarily the file name). The lsblk -f will provide some information related to this.

Part of the confusion can be because there is a “/boot” on both eMMC and NVMe. The initrd file itself, and the kernel, and the device tree, might not be on the device you think it is on (or even the extlinux.conf). Is it possible to get a full serial console boot log?

The backup image steps look correct.

Incidentally, I did not see any PCIe error message on this last dmesg.

We have a simple design: SOM Orin NX 16GB, Carrier Board and NVME SSD their is no internal emmc only external storage NVME SSD.

I have attached the full serial console boot log:
orin-serial-console.log (85.8 KB)

Here what, I found:

lsblk -f
loop0 vfat L4T-README 1234-ABCD
zram0 [SWAP]
zram1 [SWAP]
zram2 [SWAP]
zram3 [SWAP]
zram4 [SWAP]
zram5 [SWAP]
zram6 [SWAP]
zram7 [SWAP]
├─nvme0n1p1 ext4 113ae373-9fc1-4cac-8b6e-d887884e8891 835.6G 5% /
├─nvme0n1p11 vfat 92AC-0B15

The PCIe error is their at the end of the dmesg log:

IPv6: ADDRCONF(NETDEV_CHANGE): rndis0: link becomes ready
[ 599.758679] tegra-xudc 3550000.xudc: EP 13 (type: bulk, dir: in) enabled
[ 599.758697] tegra-xudc 3550000.xudc: EP 8 (type: bulk, dir: out) enabled
[ 599.758943] IPv6: ADDRCONF(NETDEV_CHANGE): usb0: link becomes ready
[ 599.759076] tegra-xudc 3550000.xudc: ep 13 disabled
[ 599.759104] tegra-xudc 3550000.xudc: ep 8 disabled
[ 599.778235] tegra-xudc 3550000.xudc: EP 13 (type: bulk, dir: in) enabled
[ 599.778258] tegra-xudc 3550000.xudc: EP 8 (type: bulk, dir: out) enabled
[ 600.958353] l4tbr0: port 2(usb0) entered blocking state
[ 600.958364] l4tbr0: port 2(usb0) entered forwarding state
[ 600.958411] l4tbr0: port 1(rndis0) entered blocking state
[ 600.958414] l4tbr0: port 1(rndis0) entered forwarding state
[ 601.377127] IPv6: ADDRCONF(NETDEV_CHANGE): l4tbr0: link becomes ready
[ 607.531222] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x730190 result 0x56:
[ 825.842103] pcieport 0004:00:00.0: AER: Corrected error received: 0004:01:00.0
[ 825.842149] nvme 0004:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 825.851810] nvme 0004:01:00.0: device [144d:a80a] error status/mask=00000040/0000e000
[ 825.860164] nvme 0004:01:00.0: [ 6] BadTLP

This is more interesting. A data link layer issue, without the physical layer having errors, could be from the NVMe itself. Basically this is a checksum type error, but the checksum is correctable. There was no error in processing the signal from PCIe over the physical connector. The NVMe might be bad, but it might also give this sort of error if power supply is not stable.

I will point out that correctable errors may not be shown on all systems. That depends on the o/s and on settings for log verbosity. Not all terminals are set to show that error even if it does occur, and so the drive is functioning in what I would call an inefficient state. I have no idea what this relates to, but it might be related:

Failed to create /rm/vdd_cpu
Failed to create /rm/vdd_cpu
debugfs initialized

On the other hand, this makes me quite curious:

[   17.279233] IRQ 115: no longer affine to CPU7

Has someone assigned IRQs to specific CPUs? If you are operating in a power model (nvpmodel) that does not use that CPU core, then this won’t really matter, but if someone has tuned for performance, this becomes interesting (invalid affinity assignments could show problems in odd ways).

What do you see from:
cat /proc/cmdline

This is what, I see using this:

cat /proc/cmdline
root=PARTUUID=f780356d-e6d8-4811-81ee-6171f451907c rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 net.ifnames=0

I have 4-5 Multiple Orin NX 16GB SOM and different variants SSDs Samsung 980 pro 1TB NVME SSDs and 2TB Samsung 980 PRO and Western Digital 500GB SSD.

I am facing same issue with different SOMs and SSDs.

There can be more than one console in cmdline, but this would normally be restricted to one serial console and one local console on the attached monitor. ttyTCU0 would be the one I expect for the built in serial console port. What is ttyAMA0? Do you have additional serial UART hardware? I’m wondering somehow the device tree or other setup is modified for this. Technically, this shouldn’t care about PCIe, but if something is different on these, and if modifications were made to adapt to this, then perhaps it is related.

I don’t really have a way to find out what it is in the data link layer that is going wrong. We know that the physical layer is working. The handling of the data and its checksum is a rather wide list of possibilities I can’t even guess at without more clues. Thus I am hoping one of these other seemingly unrelated issues might offer a clue, e.g., vdd_cpu or extra serial UART setup.

There is no additional serial UART hardware.

It looks like my Orin (AGX model) also has /dev/ttyAMA0. It also has both ttyAMA0 and ttyTCU0 as group tty. This is a change I did not know about, but it is odd to see this. NVIDIA may need to reply as to why both of these are consoles. Maybe @WayneWWW can answer if there are two serial consoles for some specific reason (my particular system runs L4T R35.x). However, this would not be related to the PCIe issue. My reason for bringing this up is that it had appeared that either boot parameters or device tree had been edited.

A big problem is that this is not the physical layer in error. It is somewhat easier to track signal issues or device tree issues related to the PHY. You have a CRC error though, and it is in a data layer. This depends on the design of the device on the PCIe bus and it may not be possible to figure out why without knowing either details about the particular hardware design, or perhaps with a lot of Linux kernel driver edits to narrow in on what is going on.

I’m tempted to say that you will need to send the fully verbose log to kernel.org, but before doing that, what is the output of (a summary in one place):

  • head -n 1 /etc/nv_tegra_release
  • cat /proc/cmdline
  • uname -a
  • lsmod

Here what I found,

head -n 1 /etc/nv_tegra_release

R35 (release), REVISION: 2.1, GCID: 32413640, BOARD: t186ref, EABI: aarch64, DATE: Tue Jan 24 23:38:33 UTC 2023

cat /proc/cmdline
root=PARTUUID=f780356d-e6d8-4811-81ee-6171f451907c rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 net.ifnames=0

uname -a
Linux orin112 5.10.104-tegra #1 SMP PREEMPT Tue Jan 24 15:09:44 PST 2023 aarch64 aarch64 aarch64 GNU/Linux

Module Size Used by
nvidia_modeset 1093632 3
fuse 118784 5
lzo_rle 16384 16
lzo_compress 16384 1 lzo_rle
zram 32768 4
ramoops 28672 0
reed_solomon 20480 1 ramoops
loop 36864 1
snd_soc_tegra210_iqc 16384 0
snd_soc_tegra210_ope 32768 1
snd_soc_tegra186_dspk 20480 2
snd_soc_tegra186_asrc 36864 1
snd_soc_tegra186_arad 24576 2 snd_soc_tegra186_asrc
snd_soc_tegra210_mvc 20480 2
snd_soc_tegra210_afc 20480 6
snd_soc_tegra210_admaif 118784 1
snd_soc_tegra210_adx 28672 4
snd_soc_tegra210_mixer 45056 1
snd_soc_tegra210_dmic 20480 4
snd_soc_tegra_pcm 16384 1 snd_soc_tegra210_admaif
snd_soc_tegra210_amx 32768 4
snd_soc_tegra210_i2s 24576 6
snd_soc_tegra210_sfc 57344 4
aes_ce_blk 36864 0
crypto_simd 24576 1 aes_ce_blk
cryptd 28672 1 crypto_simd
aes_ce_cipher 20480 1 aes_ce_blk
ghash_ce 28672 0
sha2_ce 20480 0
sha256_arm64 28672 1 sha2_ce
sha1_ce 20480 0
snd_soc_spdif_tx 16384 0
snd_soc_tegra_machine_driver 16384 0
snd_hda_codec_hdmi 57344 1
snd_soc_tegra210_ahub 1228800 3 snd_soc_tegra210_ope,snd_soc_tegra210_sfc
snd_soc_tegra210_adsp 753664 1
r8168 471040 0
tegra_bpmp_thermal 16384 0
userspace_alert 16384 0
snd_soc_tegra_utils 28672 3 snd_soc_tegra210_admaif,snd_soc_tegra_machine_driver,snd_soc_tegra210_adsp
snd_hda_tegra 16384 0
snd_soc_simple_card_utils 24576 1 snd_soc_tegra_utils
tegra210_adma 28672 2 snd_soc_tegra210_admaif,snd_soc_tegra210_adsp
nvadsp 110592 1 snd_soc_tegra210_adsp
snd_hda_codec 118784 2 snd_hda_codec_hdmi,snd_hda_tegra
snd_hda_core 81920 3 snd_hda_codec_hdmi,snd_hda_codec,snd_hda_tegra
nv_imx219 20480 0
r8169 81920 0
spi_tegra114 32768 0
realtek 24576 1
nvidia 1339392 7 nvidia_modeset
binfmt_misc 24576 1
ina3221 24576 0
pwm_fan 24576 0
nvgpu 2494464 20
nvmap 192512 62 nvgpu
ip_tables 36864 0
x_tables 49152 1 ip_tables

Is there a possibility of getting help from the device vendor? You would:

  • Explain it is running in Linux.
  • There is a non-bus checksum error.
  • Attach the lsmod, uname -r, etc., information to that manufacturer.
  • Request information on what might be the cause of a checksum error in the data layer if the physical layer is running without error.