PCIe Bus Error

Hi,

I designed an own carrier card for the Xavier module. I insert a x4 lane card into customer board’s PCIe slot. I keep getting the following error and the user screen (desktop) is not displayed. It keeps printings this error.

PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver … device [10de:0fae] error status/mask=00000000/00002000
[ 0] Receiver Error

When I do research on the internet, it says that a change can be made to the “grub” file in the /etc/default/ folder as follows.
image
However, there is no such file in the Xavier module. How can I change the pcie boot arguments at Xavier? Is there anything for Xavier like this one too? What change should I make in which file?

And does the change there ("PCIe Bus Error: severity=Corrected" on Jetson Nano) apply to Xavier as well?

It is mostly because of ASPM L1 being enabled. You can give it a try by disabling ASPM using the following methods. Many of the off the shelf PCIe devices have issues with ASPM states (even though they advertise their support for ASPM states)

  • Disabling ASPM can be achieved in the following ways for a platform
  • Disabling from the beginning
    • Appending ‘pcie_aspm=off’ to the kernel command line
    • Removing “CONFIG_PCIEASPM_POWERSAVE=y” and setting “CONFIG_PCIEASPM_PERFORMANCE=y” in the kernel configuration
  • Disabling after system boots to console
    • Executing the below command once the system boots to console
      • echo “performance” > /sys/module/pcie_aspm/parameters/policy

GRUB only works on a motherboard with a BIOS. Embedded systems do not have a BIOS. The Jetson’s boot content and many partitions are essentially a custom version of GRUB and the BIOS in software. In many of the Jetson’s various boot stages lead to U-Boot, and it is U-Boot which takes the place of GRUB. In the case of the AGX Xavier there is a boot stage known as “CBoot”, and instead of this loading U-Boot, it has some functionality which mirrors U-Boot and kernel load is directly from CBoot to the Linux kernel.

Anything regarding updates via GRUB will be incorrect, except that if the update is to change kernel command line parameters, the command line parameters themselves will be correct. You have find a way to use the CBoot (or U-Boot) mechanism to add that same parameter.

Assuming that the above information really has only the goal to get “pci=nomsi” as a kernel command line parameter, you are in luck. Some of the other content is probably unrelated to what you are doing, but the addition of kernel command line parameter “pci=nomsi” looks to be the real goal of that patch or research.

You current kernel command line can be viewed on a running system via “cat /proc/cmdline”. If you add your parameter, then you’d see “pci=nomsi” somewhere in that command line.

Before I show the change, be aware that you can use serial console and pick among multiple boot entries. If something goes wrong, then picking the original boot entry (without any modification) would get you back in the system without any effort. I’ll show you how to add an entry rather than how to just throw away the original entry with your modification.

File “/boot/extlinux.conf” is somewhat similar to a GRUB setup. Within this you will see this excerpt as something similar to this (but I’m using an NX instead of AGX so yours will differ):

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      INITRD /boot/initrd
      APPEND ${cbootargs} root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0

Note that the “APPEND” parameter appends kernel command line arguments (space delimited). If you were to modify this entry, then it would be something like this:

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      INITRD /boot/initrd
      APPEND ${cbootargs} root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 pci=nomsi

…and note I added “pci=nomsi” at the end of the APPEND line.

To instead create an alternate boot entry which can be picked from serial console command:

LABEL testing
      MENU LABEL testing
      LINUX /boot/Image
      INITRD /boot/initrd
      APPEND ${cbootargs} root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 pci=nomsi

If you had this second entry, then the first entry would remain available even if the second entry fails boot. During boot you’d have a short moment to select either the “primary kernel” or “testing” entry via serial console (the default still goes to the original entry). To tell extlinux.conf to boot your new entry by default (but still leave the original entry), change:
Default primary
…to instead be:
Default testing

Almost forgot: Nano and Xavier, if using PCIe, are more or less the same so far as many kernel command line options go. Not a guarantee, but much is the same between all of the Jetsons and even a desktop PC.

Hi,
Thank you for the information. Really useful information.
As you said, I changed “pci_nomsi” in the “/boot/extlinux.conf” file and it worked. When I insert a x4 lane card into customer board’s PCIe slot it no longer gives the error. however, this change seems like a temporary solution. How correct is it to turn off interrupts? Is there any other permanent way to fix this error?

There is also another problem with the pcie. I have another pcie card for x4 lanes. When I insert that card, the module freezes in a certain place and does not turn on. Every time I turn this card on and off, it prints two different logs on the screen and then the module freezes. The parts printed on the screen are below. What do you think could be the problem here?


A serial console full boot log would be far better. See:
https://elinux.org/Jetson/General_debug
(which logs content even prior to Linux itself loading, and is searchable…often the error is less important that what goes on prior to and leading up to the error)

Turning off interrupts is likely to be equivalent to removing the drivers. If the IRQ triggers a driver to run, and the driver has a failure, I wouldn’t blame it on the IRQ. The driver itself might even be correct, but some argument needs to be passed to it for a specific hardware situation. Depends on the card itself so I can’t really say anything useful on that. The next step is to provide a full serial console boot log.