Disable ASPM via kernel command line

I have a Jetson Nano module connected to a FPGA PCIe endpoint. During boot-up sometimes the Jetson is stuck due to correctable ASPM related PCIe error ("PCIe Bus Error: severity=Corrected" on Jetson Nano). Disabling ASPM during boot seems to be recommended solution. Since Jetson Nano does not use BIOS, the best way to disable ASPM is via kernel command line by updating parameter.
Please advise on how this can be achieved.

head -n 1 /etc/nv_tegra_release 
# R32 (release), REVISION: 7.1, GCID: 29818004, BOARD: t210ref, EABI: aarch64,

Disaling ASPM? It’s mentioned at "PCIe Bus Error: severity=Corrected" on Jetson Nano - #3 by vidyas

Yes, I would like to disable from beginning as described in the post -

  • Disabling from the beginning
    • Appending ‘pcie_aspm=off’ to the kernel command line
    • Removing “CONFIG_PCIEASPM_POWERSAVE=y” and setting “CONFIG_PCIEASPM_PERFORMANCE=y” in the kernel configuration

I’m new to linux, so not sure how to append ‘pcie_aspm=off’ to kernel command line - could you share command sequence or procedure?
Also, where is the “CONFIG_PCIEASPM_POWERSAVE=y” setting? How do I set "“CONFIG_PCIEASPM_PERFORMANCE=y”? Are these stored in some file?

Kernel command line could be done with device tree, but by far the simplest way is to edit file “/boot/extlinux/extlinux.conf”. Find the “APPEND” key/value pair in “extlinux.conf”. Note that even if it line wraps it is a single long line, with parameters separated by a space. Simply add a space at the end of the line followed by “ ASPCM=off”.

You don’t really need to customize the kernel, but if you do, then check the “kernel customization” part of the R32.7.1 documentation.
https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3261/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/kernel_custom.html#

You would have to compile a new kernel which starts with the default configuration, set those values to “n” (no) after the default, then build and install both the kernel and the modules (after setting a “y” feature to “n” it is usually best to reinstall modules as well). You’ll want a new “CONFIG_LOCALVERSION”, and probably save the old kernel Image and add the new one as something like “Image-noaspcm” (and then the recommended “CONFIG_LOCALVERSION” would be “=noaspcm”).

You don’t need to work on the kernel if the command line is in place. You can verify the kernel command line has what you want via “cat /proc/cmdline”.

Thank you for sharing that!

The errors don’t completely go away with the fix, but the frequency seems to change based on power sequence.

If the error is during boot, then it might be the FPGA not being ready when PCIe bus is scanned. As an experiment, have you tried making sure the FPGA gets power and is booted prior to the Jetson itself beginning boot?

The error is seen during bootup. FPGA is configured and ready prior to Jetson being powered up. I see different behavior depending on power sequence

For following sequence, error message is printed few times before Jetson bootup continues and completes:
FPGA power cycle → FPGA Endpoint ready → Jetson power on → error seen → boot complete → login screen

From this ‘good’ state, if Jetson is power cycled (without power cycling FPGA Endpoint), then the error keeps being printed and Jetson is stuck in boot state. Would this point to some difference in enumeration/training phase when only Jetson is power cycled?

I couldn’t say on what this means, but it does mean you will need to post a serial console boot log (which includes logging prior to Linux starting) for both cases:

  • Initial cold boot with FPGA already powered, and error but boot continues,
  • and reboot (warm boot) with error which blocks.

Without a serial console boot log of both cases I don’t think there is a possibility of narrowing the issue. Be sure to emphasize each time a log is posted as to whether the FPGA was already powered on prior to power or reboot of the Jetson.

serial_coldboot_complete.txt (31.7 KB)

ok,
Initial cold boot with FPGA already powered, and error but boot continues
In short, steps are:

  1. FPGA power off
  2. FPGA power on
  3. FPGA configured and Endpoint ready
  4. Jetson Power on
  5. Power down Nano while FPGA is untouched (no power cycle for FPGA).
    The log for steps 1-5 is contained in ‘serial_coldboot_complete.txt’ file.

serial_coldboot_stuck.txt (309.0 KB)

After 1-5, next steps are:
6. Nano is powered on and gets stuck
The log for step 6, is contained in file ‘serial_coldboot_stuck.txt’

I do not know the actual error, but it does list it as a bus error, which means at the level of the PHY and not the driver or software which actually works with the FPGA. If you run the command “lspci”, then it will show you a list of devices on the PCI bus, starting with a number which identifies the slot of the device being identified. An example is you might see something like this (the starting number format is what matters):
68:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1)

In the above I will use “68:00.0” as an example, but if you have the case where you get an error, but it allows login, run “lspci” to find your FPGA, and then limit the command to use just that slot, but also fully verbose (substitute your actual device slot instead of using my example device), and post the output (I’ll add what is needed to log its output to a file):
sudo lspci -s 68:00.0 -vvv 2>&1 | tee log_lspci.txt
(then post the content of log_lspci.txt)

That log should offer an increase in the detail of what the PCIe driver thinks is going wrong. Note that the issue could be software or a signal or other hardware issue.

Here’s the log -

01:00.0 Serial controller: Xilinx Corporation Device 8024 (prog-if 01 [16450])
        Subsystem: Xilinx Corporation Device 0007
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 408
        Region 0: Memory at 14000000 (32-bit, non-prefetchable) [size=64M]
        Region 1: Memory at 18000000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [80] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000ffeff000  Data: 0001
        Capabilities: [c0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: xdma
        Kernel modules: xdma

I found that adding the following kernel commandline argument will suppress printing of error message, allowing bootup to complete and enter login prompt:

pci=noaer

Your first AER pointer is NULL (00), so the PHY level is not the issue for anything related to this PCIe device. Also, the signal is probably good, as indicated by both the capability to run at 5GT/s and actual operation achieving this. Had there been errors, then probably it would have backed off to 2.5GT/s. However, I don’t think the AER would know about issues prior to switching from boot loader stages to the Linux kernel, but since the signal seems to be good I have no reason to believe boot loader stages would have any PHY issues.

About all I can do is conclude that the issue has to do with the end point driver and not the PHY driver (not the xdma serialization…but I don’t know enough about this FPGA to say if xdma also handles more than serialization). Basically there are two stages: Talking to the FPGA over PCIe, plus what the FPGA does internally. It is the software running internally, and not that related to communications, which I suspect is an issue. That issue could be as simple as timing during boot, or it could be an actual software bug, and I have no way of narrowing that down. You are probably correct to look at ASPM, but I again have no way to verify this.

The issue is seen even when FPGA’s device driver is not loaded during bootup. I deleted all device driver files from Nano and still see error so I’m guessing at this point it’s not related to device driver. Let me try to check if there is any issue with timing of fundamental reset etc being out of spec.
Meanwhile I will continue to use pci=noaer argument to enable development and debug.

I don’t see any log information for PCI prior to the Linux kernel booting, but booting has changed for newer UEFI versus older custom boot, so it may be there is PCI going on prior to the Linux kernel, but not logged. Question for NVIDIA: Is there a way to increase any kind of PCI boot logging in UEFI stages?

I could be looking at the wrong thing, but with the error message only showing up after Linux loads, and with no AER pointer other than 0x00, my conclusion is that the issue is not at the PHY level. Assuming a removed driver still shows errors, I would have to wonder why it did not show errors also during earlier boot stages if it was not dependent upon the driver. Thus the question about increasing PCI log verbosity with the newer UEFI boot content.

Note: The verbose lspci does not show the error. Perhaps we are looking at the wrong PCIe slot for the verbose lspci?

If you see the logs, the error is reported on port 00:01.0 -

[  895.428609] pcieport 0000:00:01.0:   device [10de:0fae] error status/mask=00000001/00002000
[  895.428612] pcieport 0000:00:01.0:    [ 0] Receiver Error         (First)
[  895.428832] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver ID)

The device attached to this port is a Nvidia device -

00:01.0 PCI bridge: NVIDIA Corporation Device 0fae (rev a1) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 84
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: 14000000-19ffffff
	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Subsystem: NVIDIA Corporation Device 0000
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
		Mapping Address Base: 00000000fee00000
	Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag+ RBE+
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr+ FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Off, PwrInd On, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet+ LinkState+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 0e, GenCap+ CGenEn- ChkCap+ ChkEn-
	Kernel driver in use: pcieport

Looks like the first verbose lspci happened when there was no error, but this particular most recent lspci does show a non-NULL first AER error pointer. And, as you mention, the particular device seems to be verified. I should have also looked closer at this serial log:

[    1.169093] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver ID)
[    1.179392] pcieport 0000:00:01.0:   device [10de:0fae] error status/mask=00000001/00002000
[    1.187756] pcieport 0000:00:01.0:    [ 0] Receiver Error         (First)

…which explicitly states that it is the PHY. Apparently the issue is actual signal quality since there was also a verbose lspci without error. I doubt ASPM would change signal quality unless there was some change in signal quality when power comes back up (i.e., there is some very remote possibility that power delivery is inconsistent and causes signal quality changes, but I doubt that is the case since there are times the power and signal seem to be sufficient).

I am curious about this: If ASPM is disabled, and you reboot and have an error, is it possible that after many more reboots there is at least one boot when the FPGA works again? I am curious if a number of reboots might get lucky and work at least once. Not that it would solve the issue, but it would tend to verify signal quality (you could use a PCIe bus analyzer to find out, but that’s exceedingly expensive and rare).

Please correct if my understanding is wrong -
Does the + or - after the flags in the lspci report actually indicate presence or absence of errors?
Looking at lspci log for endpoint and for root port, it seems that errors are seen in both directions RC->Endpoint and Endpoint->RC, correct?
Is there any documentation that describes the flags listed in lspci log?

This is a complicated topic, and some good details are found here:
http://trac.gateworks.com/wiki/PCI

For AER the pointer is for actual uncorrectable register bits. If this is NULL, then there are no uncorrectable bits. The URL describes the specific description of what follows, but the gist is that they indicate what is enabled or capable of being enabled for error bits. Just remember that the NULL ("00") first pointer is the indication of whether or not available error bits were set (possible reporting may not be enabled for all registers).