Hi, I am getting PCIe Bus Error. I am using Custom Carrier Board with 500GB SSD and Orin NX SOM.
Whenever, I access OrinNX using ssh and see the dmesg then pcie bus error is shown in the backend.
Hi, I am getting PCIe Bus Error. I am using Custom Carrier Board with 500GB SSD and Orin NX SOM.
Whenever, I access OrinNX using ssh and see the dmesg then pcie bus error is shown in the backend.
Not possible to give you any answer by just this one line comment.
Bus error could be from hardware or any kind of causes. If you want to find out the cause, dump the PCIe LA trace by using PCIe analyzer.
Also I suggest these logs (showing how to create a log file of these commands):
sudo lspci -vvv 2>&1 | tee log_lspci_verbose.txt
lspci -t -v 2>&1 | tee log_lspci_tree.txt
Here is the info of the provided cmd .txt file
lspci -vvv 2>&1 | tee log_lspci_verbose.txt (17.7 KB)
lspci -t -v 2>&1 | tee log_lspci_tree.txt (1.3 KB)
There are actually a lot of Intel devices showing up on the PCI bus, and a Realtek gigabit ethernet adapter. Can you tell us exactly what is plugged in to the PCIe bus? Sometimes one device can interfere with another (I have no reason to believe this is the case, but we should know what is connected), even if it is as simple as consuming more power than the bus works well with, or noise from one affecting the other even if just when some particular data pattern is present (the NVMe is running at its full speed so this is unlikely…noise tends to cause it to drop back to a slower protocol).
The NVMe is capable of PCIe v4 speeds, and is running at that speed (a good sign for signal quality and the physical layer). The gigabit controller is capable of PCIe v1 speed, and is running at that speed. The NVMe is the current suspect, I’ll mostly ignore other devices.
Also, would it be possible to get a log of the error instead of a screenshot for dmesg
? An example of getting the entire dmesg
:
dmesg 2>&1 | tee log_dmesg.txt
(even if the image were not out of focus it is hard to copy and paste for values; there may have been a kernel OOPS or error mentioned earlier as well, and any kernel error to a specific driver’s stack frame is very useful since I’m thinking the PHY is not an issue)
It would actually be better to get a full serial console boot log instead of dmesg
, although dmesg
is useful (we can live with dmesg
if needed). What dmesg
fails to show is the setup of PCIe prior to reaching the Linux kernel. Setup state prior to handing over to Linux tends to matter, and serial console will include everything dmesg
shows.
The screenshot does make it obvious that the NVMe is where the error is from, but one kind of needs to see what comes before this to have context. We know this is a symptom, but we don’t know yet why or when the issue starts (that information is not yet available to us).
This is for reference, it is the NVMe subset of the logs:
0004:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a801
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 56
Region 0: Memory at 2428000000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR+
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable+ Count=130 Masked-
Vector table: BAR=0 offset=00003000
PBA: BAR=0 offset=00002000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [178 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [1bc v1] Lane Margining at the Receiver <?>
Capabilities: [214 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [21c v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L0004:01:00.01.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [3a0 v1] Data Link Feature <?>
Kernel driver in use: nvme
If one were to use lspci
to show just this device slot, include option “-s 0004:01:00.0
”:
lspci -s 0004:01:00.0
No bits are set in the advanced error reporting bits (AER first error pointer).
I think knowing what the issue is mandates seeing the actual log of the kernel (at least from dmesg
).
Here is the full dmesg log
log_dmesg.txt (130.8 KB)
That isn’t actually the full dmesg
log. The start of kernel boot is missing. A full serial console boot log up until the error hits would be far better because this logs time before the kernel starts (a full dmesg
log would have to include the kernel start as well as the error; that particular log is just a repeat of the same error many times).
Btw, if you see the word “quiet
” anywhere in “/boot/extlinux/extlinux.conf
”, then I would recommend removing it prior to any serial console boot log.
I have flashed OrinNX with new SSD. Please find the attached full dmesg log.
bootuplogs_dmesg.txt (70.2 KB)
Only NVME SSD is plugged in to the PCIe bus, no other devices are plugged.
There is no word “quiet” anywhere In/boot/extlinux/extlinux.conf
Here the extlinux.conf textfile:
TIMEOUT 30
DEFAULT primary
MENU TITLE L4T boot options
LABEL primary
MENU LABEL primary kernel
LINUX /boot/Image
FDT /boot/dtb/kernel_tegra234-p3767-0000-p3509-a02.dtb
INITRD /boot/initrd
APPEND ${cbootargs}
When testing a custom kernel, it is recommended that you create a backup of
the original kernel and add a new entry to this file so that the device can
fallback to the original kernel. To do this:
1, Make a backup of the original kernel
sudo cp /boot/Image /boot/Image.backup
2, Copy your custom kernel into /boot/Image
3, Uncomment below menu setting lines for the original kernel
4, Reboot
LABEL backup
MENU LABEL backup kernel
LINUX /boot/Image.backup
FDT /boot/dtb/kernel_tegra234-p3767-0000-p3509-a02.dtb
INITRD /boot/initrd
APPEND ${cbootargs}
What do you see from “lsblk -f
”? I want to be sure which partitions are actually used (the NVMe is PARTUUID=f780356d-e6d8-4811-81ee-6171f451907c
, but “/boot
” is probably on eMMC if it is an eMMC model and not a dev kit…custom boards are supposed to use eMMC modules). I see no mention in the log of the .dtb
file, and to see this we likely must have a full serial console boot log instead of just a dmesg
(note that a DTS name is not necessarily the file name). The lsblk -f
will provide some information related to this.
Part of the confusion can be because there is a “/boot
” on both eMMC and NVMe. The initrd
file itself, and the kernel, and the device tree, might not be on the device you think it is on (or even the extlinux.conf
). Is it possible to get a full serial console boot log?
The backup image steps look correct.
Incidentally, I did not see any PCIe error message on this last dmesg
.
We have a simple design: SOM Orin NX 16GB, Carrier Board and NVME SSD their is no internal emmc only external storage NVME SSD.
I have attached the full serial console boot log:
orin-serial-console.log (85.8 KB)
Here what, I found:
lsblk -f
NAME FSTYPE LABEL UUID FSAVAIL FSUSE% MOUNTPOINT
loop0 vfat L4T-README 1234-ABCD
zram0 [SWAP]
zram1 [SWAP]
zram2 [SWAP]
zram3 [SWAP]
zram4 [SWAP]
zram5 [SWAP]
zram6 [SWAP]
zram7 [SWAP]
nvme0n1
├─nvme0n1p1 ext4 113ae373-9fc1-4cac-8b6e-d887884e8891 835.6G 5% /
├─nvme0n1p2
├─nvme0n1p3
├─nvme0n1p4
├─nvme0n1p5
├─nvme0n1p6
├─nvme0n1p7
├─nvme0n1p8
├─nvme0n1p9
├─nvme0n1p10
├─nvme0n1p11 vfat 92AC-0B15
├─nvme0n1p12
├─nvme0n1p13
└─nvme0n1p14
The PCIe error is their at the end of the dmesg log:
IPv6: ADDRCONF(NETDEV_CHANGE): rndis0: link becomes ready
[ 599.758679] tegra-xudc 3550000.xudc: EP 13 (type: bulk, dir: in) enabled
[ 599.758697] tegra-xudc 3550000.xudc: EP 8 (type: bulk, dir: out) enabled
[ 599.758943] IPv6: ADDRCONF(NETDEV_CHANGE): usb0: link becomes ready
[ 599.759076] tegra-xudc 3550000.xudc: ep 13 disabled
[ 599.759104] tegra-xudc 3550000.xudc: ep 8 disabled
[ 599.778235] tegra-xudc 3550000.xudc: EP 13 (type: bulk, dir: in) enabled
[ 599.778258] tegra-xudc 3550000.xudc: EP 8 (type: bulk, dir: out) enabled
[ 600.958353] l4tbr0: port 2(usb0) entered blocking state
[ 600.958364] l4tbr0: port 2(usb0) entered forwarding state
[ 600.958411] l4tbr0: port 1(rndis0) entered blocking state
[ 600.958414] l4tbr0: port 1(rndis0) entered forwarding state
[ 601.377127] IPv6: ADDRCONF(NETDEV_CHANGE): l4tbr0: link becomes ready
[ 607.531222] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x730190 result 0x56:
[ 825.842103] pcieport 0004:00:00.0: AER: Corrected error received: 0004:01:00.0
[ 825.842149] nvme 0004:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 825.851810] nvme 0004:01:00.0: device [144d:a80a] error status/mask=00000040/0000e000
[ 825.860164] nvme 0004:01:00.0: [ 6] BadTLP
This is more interesting. A data link layer issue, without the physical layer having errors, could be from the NVMe itself. Basically this is a checksum type error, but the checksum is correctable. There was no error in processing the signal from PCIe over the physical connector. The NVMe might be bad, but it might also give this sort of error if power supply is not stable.
I will point out that correctable errors may not be shown on all systems. That depends on the o/s and on settings for log verbosity. Not all terminals are set to show that error even if it does occur, and so the drive is functioning in what I would call an inefficient state. I have no idea what this relates to, but it might be related:
Failed to create /rm/vdd_cpu
Failed to create /rm/vdd_cpu
debugfs initialized
On the other hand, this makes me quite curious:
[ 17.279233] IRQ 115: no longer affine to CPU7
Has someone assigned IRQs to specific CPUs? If you are operating in a power model (nvpmodel
) that does not use that CPU core, then this won’t really matter, but if someone has tuned for performance, this becomes interesting (invalid affinity assignments could show problems in odd ways).
What do you see from:
cat /proc/cmdline
This is what, I see using this:
cat /proc/cmdline
root=PARTUUID=f780356d-e6d8-4811-81ee-6171f451907c rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 net.ifnames=0
I have 4-5 Multiple Orin NX 16GB SOM and different variants SSDs Samsung 980 pro 1TB NVME SSDs and 2TB Samsung 980 PRO and Western Digital 500GB SSD.
I am facing same issue with different SOMs and SSDs.
There can be more than one console
in cmdline
, but this would normally be restricted to one serial console and one local console on the attached monitor. ttyTCU0
would be the one I expect for the built in serial console port. What is ttyAMA0
? Do you have additional serial UART hardware? I’m wondering somehow the device tree or other setup is modified for this. Technically, this shouldn’t care about PCIe, but if something is different on these, and if modifications were made to adapt to this, then perhaps it is related.
I don’t really have a way to find out what it is in the data link layer that is going wrong. We know that the physical layer is working. The handling of the data and its checksum is a rather wide list of possibilities I can’t even guess at without more clues. Thus I am hoping one of these other seemingly unrelated issues might offer a clue, e.g., vdd_cpu
or extra serial UART setup.
There is no additional serial UART hardware.
It looks like my Orin (AGX model) also has /dev/ttyAMA0
. It also has both ttyAMA0
and ttyTCU0
as group tty
. This is a change I did not know about, but it is odd to see this. NVIDIA may need to reply as to why both of these are consoles. Maybe @WayneWWW can answer if there are two serial consoles for some specific reason (my particular system runs L4T R35.x). However, this would not be related to the PCIe issue. My reason for bringing this up is that it had appeared that either boot parameters or device tree had been edited.
A big problem is that this is not the physical layer in error. It is somewhat easier to track signal issues or device tree issues related to the PHY. You have a CRC error though, and it is in a data layer. This depends on the design of the device on the PCIe bus and it may not be possible to figure out why without knowing either details about the particular hardware design, or perhaps with a lot of Linux kernel driver edits to narrow in on what is going on.
I’m tempted to say that you will need to send the fully verbose log to kernel.org, but before doing that, what is the output of (a summary in one place):
head -n 1 /etc/nv_tegra_release
cat /proc/cmdline
uname -a
lsmod
Here what I found,
head -n 1 /etc/nv_tegra_release
cat /proc/cmdline
root=PARTUUID=f780356d-e6d8-4811-81ee-6171f451907c rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 net.ifnames=0
uname -a
Linux orin112 5.10.104-tegra #1 SMP PREEMPT Tue Jan 24 15:09:44 PST 2023 aarch64 aarch64 aarch64 GNU/Linux
lsmod
Module Size Used by
nvidia_modeset 1093632 3
fuse 118784 5
lzo_rle 16384 16
lzo_compress 16384 1 lzo_rle
zram 32768 4
ramoops 28672 0
reed_solomon 20480 1 ramoops
loop 36864 1
snd_soc_tegra210_iqc 16384 0
snd_soc_tegra210_ope 32768 1
snd_soc_tegra186_dspk 20480 2
snd_soc_tegra186_asrc 36864 1
snd_soc_tegra186_arad 24576 2 snd_soc_tegra186_asrc
snd_soc_tegra210_mvc 20480 2
snd_soc_tegra210_afc 20480 6
snd_soc_tegra210_admaif 118784 1
snd_soc_tegra210_adx 28672 4
snd_soc_tegra210_mixer 45056 1
snd_soc_tegra210_dmic 20480 4
snd_soc_tegra_pcm 16384 1 snd_soc_tegra210_admaif
snd_soc_tegra210_amx 32768 4
snd_soc_tegra210_i2s 24576 6
snd_soc_tegra210_sfc 57344 4
aes_ce_blk 36864 0
crypto_simd 24576 1 aes_ce_blk
cryptd 28672 1 crypto_simd
aes_ce_cipher 20480 1 aes_ce_blk
ghash_ce 28672 0
sha2_ce 20480 0
sha256_arm64 28672 1 sha2_ce
sha1_ce 20480 0
snd_soc_spdif_tx 16384 0
snd_soc_tegra_machine_driver 16384 0
snd_hda_codec_hdmi 57344 1
snd_soc_tegra210_ahub 1228800 3 snd_soc_tegra210_ope,snd_soc_tegra210_sfc
snd_soc_tegra210_adsp 753664 1
r8168 471040 0
tegra_bpmp_thermal 16384 0
userspace_alert 16384 0
snd_soc_tegra_utils 28672 3 snd_soc_tegra210_admaif,snd_soc_tegra_machine_driver,snd_soc_tegra210_adsp
snd_hda_tegra 16384 0
snd_soc_simple_card_utils 24576 1 snd_soc_tegra_utils
tegra210_adma 28672 2 snd_soc_tegra210_admaif,snd_soc_tegra210_adsp
nvadsp 110592 1 snd_soc_tegra210_adsp
snd_hda_codec 118784 2 snd_hda_codec_hdmi,snd_hda_tegra
snd_hda_core 81920 3 snd_hda_codec_hdmi,snd_hda_codec,snd_hda_tegra
nv_imx219 20480 0
r8169 81920 0
spi_tegra114 32768 0
realtek 24576 1
nvidia 1339392 7 nvidia_modeset
binfmt_misc 24576 1
ina3221 24576 0
pwm_fan 24576 0
nvgpu 2494464 20
nvmap 192512 62 nvgpu
ip_tables 36864 0
x_tables 49152 1 ip_tables
Is there a possibility of getting help from the device vendor? You would:
lsmod
, uname -r
, etc., information to that manufacturer.This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.