Self Encrypting SSD boot crash

We use a self encrypting SSD (SED) as external storage (connected to the M.2 slot). This SSD does not contain any data, but UEFI crashes during boot. When enabling the logging for the UEFI binary, there is not one specific error/moment when the crash occurs. When disabling the PCIe controller in UEFI (PcieControllerDxe), UEFI doesn’t crash anymore but the boot crashes often in the Linux kernel, and when it doesn’t crash, the SSD drive isn’t available (see last log).

Are there specific configurations that have to be enabled for SED support? The SSD we use uses the TCG OPAL 2.0 protocol.

In the linux kernel, it often crashes during the PCIe initialization (these are the last logs):

[    3.906169] Bluetooth: RFCOMM socket layer initialized
[    3.910456] Bluetooth: RFCOMM ver 1.11
[    3.913987] Bluetooth: HIDP (Human Interface Emulation) ver 1.2
[    3.919985] Bluetooth: HIDP socket layer initialized
[    3.926208] 9pnet: Installing 9P2000 support
[    3.929525] Key type dns_resolver registered
[    3.935146] registered taskstats version 1
[    3.937979] Loading compiled-in X.509 certificates
[    3.942875] Key type ._fscrypt registered
[    3.946620] Key type .fscrypt registered
[    3.950814] Key type fscrypt-provisioning registered
[    3.957898] tegra194-pcie 14180000.pcie: Adding to iommu group 7
[    3.968588] tegra194-pcie 14180000.pcie: host bridge /pcie@14180000 ranges:
[    3.969054] tegra194-pcie 14180000.pcie:       IO 0x0038100000..0x00381fffff -> 0x0038100000
[    3.977236] tegra194-pcie 14180000.pcie:      MEM 0x1800000000..0x1b3fffffff -> 0x1800000000
[    3.985653] tegra194-pcie 14180000.pcie:      MEM 0x1b40000000..0x1bffffffff -> 0x0040000000

Sometimes, there is no crash during the boot, but the SSD isn’t recognised. The following errors are shown in the logs:

[    4.082859] tegra194-pcie 14100000.pcie: Adding to iommu group 8
[    4.085097] tegra194-pcie 14100000.pcie: host bridge /pcie@14100000 ranges:
[    4.085364] tegra194-pcie 14100000.pcie:       IO 0x0030100000..0x00301fffff -> 0x0030100000
[    4.085621] tegra194-pcie 14100000.pcie:      MEM 0x1200000000..0x122fffffff -> 0x1200000000
[    4.085869] tegra194-pcie 14100000.pcie:      MEM 0x1230000000..0x123fffffff -> 0x0040000000
[    4.103456] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.103707] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[    4.107843] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000001/0000e000
[    4.116239] pcieport 0000:00:00.0:    [ 0] RxErr                  (First)
[    4.123158] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.131031] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[    4.140749] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000001/0000e000
[    4.148949] pcieport 0000:00:00.0:    [ 0] RxErr                  (First)
[    4.155540] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.163748] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.169948] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.177758] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[    4.187465] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000001/0000e000
[    4.196129] pcieport 0000:00:00.0:    [ 0] RxErr                  (First)
[    4.202883] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.210945] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[    4.220633] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000001/0000e000
[    4.229016] pcieport 0000:00:00.0:    [ 0] RxErr                  (First)
[    4.235754] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.242845] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.249053] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.256656] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[    4.266125] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000001/0000e000
[    4.274442] pcieport 0000:00:00.0:    [ 0] RxErr                  (First)
[    4.281448] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.289477] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.295515] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.303936] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.310163] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.318028] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[    4.327904] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000001/0000e000
[    4.336213] pcieport 0000:00:00.0:    [ 0] RxErr                  (First)
[    4.342811] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.350142] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[    4.360008] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000001/0000e000
[    4.368440] pcieport 0000:00:00.0:    [ 0] RxErr                  (First)
[    4.374889] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.382535] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[    4.391894] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000001/0000e000
[    4.400458] pcieport 0000:00:00.0:    [ 0] RxErr                  (First)
[    4.407265] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.415458] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.421623] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.428908] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.435220] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.442262] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.448555] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.456609] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.463108] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.470429] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.476751] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.483745] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.490026] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.498344] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.504375] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.511903] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.517941] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.525474] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.531505] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.539572] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.545855] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.553386] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.559696] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.566700] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.572980] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.581301] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.587594] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.594853] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.600923] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.608414] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.614472] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.621998] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.628017] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.636330] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.642624] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.650431] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.656715] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.664242] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.670280] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.678334] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.684892] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.692684] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.699239] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.706247] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.712811] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.720854] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.727147] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.734171] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.740472] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.747977] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.754290] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    4.761541] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.767835] pcieport 0000:00:00.0: AER: Multiple Corrected error received: 0000:00:00.0
[    4.775660] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    4.782188] pcieport 0000:00:00.0: AER: Uncorrected (Fatal) error received: 0000:00:00.0
[    4.789998] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[    4.800773] pcieport 0000:00:00.0:   device [10de:1ad0] error status/mask=00000020/00400000
[    4.809072] pcieport 0000:00:00.0:    [ 5] SDES                   (First)
[    5.203386] tegra194-pcie 14100000.pcie: Phy link never came up
[    5.203843] tegra194-pcie 14100000.pcie: PCI host bridge to bus 0001:00
[    5.204001] pci_bus 0001:00: root bus resource [bus 00-ff]
[    5.204142] pci_bus 0001:00: root bus resource [io  0x100000-0x1fffff] (bus address [0x30100000-0x301fffff])
[    5.204341] pci_bus 0001:00: root bus resource [mem 0x1200000000-0x122fffffff pref]
[    5.204497] pci_bus 0001:00: root bus resource [mem 0x1230000000-0x123fffffff] (bus address [0x40000000-0x4fffffff])
[    5.204789] pci 0001:00:00.0: [10de:1ad2] type 01 class 0x060400
[    5.205105] pci 0001:00:00.0: PME# supported from D0 D3hot D3cold
[    5.215409] pci 0001:00:00.0: PCI bridge to [bus 01-ff]
[    5.215714] pci 0001:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[    5.216637] pcieport 0001:00:00.0: Adding to iommu group 8
[    5.217217] pcieport 0001:00:00.0: PME: Signaling with IRQ 26
[    5.217880] pcieport 0001:00:00.0: AER: enabled with IRQ 26
[    5.218593] pci_bus 0001:01: busn_res: [bus 01-ff] is released
[    5.219168] pci 0001:00:00.0: Removing from iommu group 8
[    5.219494] pci_bus 0001:00: busn_res: [bus 00-ff] is released
[    5.220875] tegra194-pcie 14140000.pcie: Adding to iommu group 9
[    5.226949] tegra194-pcie 14140000.pcie: host bridge /pcie@14140000 ranges:
[    5.231903] tegra194-pcie 14140000.pcie:       IO 0x0034100000..0x00341fffff -> 0x0034100000
[    5.240326] tegra194-pcie 14140000.pcie:      MEM 0x1280000000..0x12afffffff -> 0x1280000000
[    5.248964] tegra194-pcie 14140000.pcie:      MEM 0x12b0000000..0x12bfffffff -> 0x0040000000
[    5.363486] tegra194-pcie 14140000.pcie: Link up
[    5.365693] tegra194-pcie 14140000.pcie: PCI host bridge to bus 0003:00
[    5.365890] pci_bus 0003:00: root bus resource [bus 00-ff]
[    5.366036] pci_bus 0003:00: root bus resource [io  0x200000-0x2fffff] (bus address [0x34100000-0x341fffff])
[    5.366284] pci_bus 0003:00: root bus resource [mem 0x1280000000-0x12afffffff pref]
[    5.366533] pci_bus 0003:00: root bus resource [mem 0x12b0000000-0x12bfffffff] (bus address [0x40000000-0x4fffffff])
[    5.366892] pci 0003:00:00.0: [10de:1ad2] type 01 class 0x060400
[    5.367293] pci 0003:00:00.0: PME# supported from D0 D3hot D3cold
[    5.375602] pci 0003:01:00.0: [8086:15f2] type 00 class 0x020000
[    5.375951] pci 0003:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff]
[    5.376323] pci 0003:01:00.0: reg 0x1c: [mem 0x00000000-0x00003fff]
[    5.377536] pci 0003:01:00.0: PME# supported from D0 D3hot D3cold
[    5.385764] pci 0003:00:00.0: BAR 14: assigned [mem 0x12b0000000-0x12b01fffff]
[    5.385999] pci 0003:01:00.0: BAR 0: assigned [mem 0x12b0000000-0x12b00fffff]
[    5.386252] pci 0003:01:00.0: BAR 3: assigned [mem 0x12b0100000-0x12b0103fff]
[    5.386539] pci 0003:00:00.0: PCI bridge to [bus 01-ff]
[    5.386701] pci 0003:00:00.0:   bridge window [mem 0x12b0000000-0x12b01fffff]
[    5.388511] pci 0003:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[    5.396759] pci 0003:01:00.0: Max Payload Size set to  256/ 512 (was  128), Max Read Rq  512
[    5.406280] pcieport 0003:00:00.0: Adding to iommu group 9
[    5.411410] pcieport 0003:00:00.0: PME: Signaling with IRQ 28
[    5.417162] pcieport 0003:00:00.0: AER: enabled with IRQ 28
[    5.423035] igc 0003:01:00.0: Adding to iommu group 9
[    5.427929] igc 0003:01:00.0: enabling device (0000 -> 0002)
[    5.494897] igc 0003:01:00.0 (unnamed net_device) (uninitialized): PHC added
[    5.559891] igc 0003:01:00.0: 4.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x1 link)
[    5.560115] igc 0003:01:00.0 eth0: MAC: 00:0f:11:57:15:3a
[    5.561206] tegra194-pcie 141a0000.pcie: Adding to iommu group 10
[    5.562227] tegra194-pcie 141a0000.pcie: Failed to get slot regulators: -517
[    5.585003] host1x 13e10000.host1x: Adding to iommu group 11
[    5.602009] host1x 13e10000.host1x: initialized

hello user148820,

sorry, we do not have such SED (self-encrypting drive) device for testing locally.
is this a device for using PCIE interface?

Yes, it is a PCIe Gen 4 m.2 SSD (we use the Micron 7450). It should use the “nvme” driver on Linux as it is an NVMe drive.
Other NVMe SSD drives work (like the Samsung 970 Pro), but the Micron 7450 SSD causes the crash during boot.

I can’t help, but I am going to suggest that the AER (Automatic Error Reporting) is a PCI mechanism, and is probably unrelated to the actual drive being communicated with. Sometimes something like power consumption will have an effect on PCIe, so if you have another PCIe device for that slot, especially if it is one using less power, you might test that. See if PCIe functions. Also, test the m.2 device on another system, and see if it notes any AER issue.

Do you test that on devkit or custom board?

This is tested on a custom board. But we’ve noticed that the problem is not the SED drive, as we tested a different drive which did work. So we are gonna test for cross-powering to check if the issue maybe lays in the HW

1 Like

Please check the PMIC configuration for the custom board

I have also tested the Micron 7300 Pro m.2 SSD on the NVIDIA Jetson AGX Xavier devkit, but the PCIe link stays down and the SSD is also not recognised (lspci doesn’t show any devices). So this is not a problem specific to our custom board.

As other SSD’s work properly (like the Samsung 980 pro and Samsung 970 pro), it seems to be an issue with specific Micron SSDs. Also it’s not caused by the self encryption because the samsunt g 970 pro and samsung 980 pro are also SED drives (and support the Opal V2.0 protocol).

Any idea what else it could be?

I’ve placed some debugging in UEFI, and it seems like it crashes at the same spot as this post: Jetson UEFI firmware hangs on custom carrier board - #51 by WayneWWW.
The advice there was to disable that PCIe lane in UEFI. This fixes the PCIe crash in UEFI (of course), but the Jetson still often crashes during Linux kernel boot. Also, when succesfully booted, the nvme SSD still not shows up with “lsblk” or “lspci”.

It doesn’t seem to be an nvme issue as the driver doesn’t seem to be loaded yet (the system already crashes at the PCIe controller). @WayneWWW any idea how this can be solved?

Hi,

I know nothing about the device in use here.

And it is not possible to debug your issue with such info. Please try to reproduce your issue on NV devkit with same drive or simplify your usecase to common nvme to see if you can reproduce or not.

Not all m.2 is PCI, which seems odd, but to see what should be there, do you have another Linux computer you could plug this into, and find the lspci content there for reference?

Incidentally, the more brief “lspci” produces a slot ID. At the left side of that output you’ll see something in the form of “01:00.0”. You can limit a query to that specific ID, e.g.:
lspci -s '01:00.0'

Then you can add fully verbose query (this requires sudo):
sudo lspci -s '01:00.0' -vvv

You can even log this:
sudo lspci -s '01:00.0' -vvv 2>&1 | tee log_lspci.txt

If you are in some way able to get the PCI information from another Linux computer, then it might verify a few things (especially the fully verbose version).

Yes, I have that information here logged from a different PC where the drive works properly:

01:00.0 Non-Volatile memory controller: Micron Technology Inc Device 51c3 (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Micron Technology Inc Device 3e00
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	NUMA node: 0
	BIST result: 00
	Region 0: Memory at df440000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at df400000 [disabled] [size=256K]
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable- Count=1/1 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [b0] MSI-X: Enable+ Count=256 Masked-
		Vector table: BAR=0 offset=00020000
		PBA: BAR=0 offset=00023000
	Capabilities: [c0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 10.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <256ns, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (downgraded), Width x1 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
			 10BitTagComp+, 10BitTagReq-, OBFF Via message, ExtFmt+, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [1b8 v1] Latency Tolerance Reporting
		Max snoop latency: 3145728ns
		Max no snoop latency: 3145728ns
	Capabilities: [300 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Capabilities: [920 v1] Lane Margining at the Receiver <?>
	Capabilities: [9c0 v1] Physical Layer 16.0 GT/s <?>
	Kernel driver in use: nvme
	Kernel modules: nvme

So this shows it’s an NVME drive. I tried adding logging to the NVME driver in the Linux kernel on the NVIDIA Jetson, but noticed that it doesn’t seem to load it (as the error occurs in the PCIe controller and not the NVME drive).

Just to clarify. What is the exact situation you want to ask here?

Honestly, I can and only want to debug this issue on devkit situation.

If your issue is nvme ssd from one brand is not able to get detected on devkit, please file a new topic and share the dmesg and lspci -vvv result on devkit.

Yeah, that’s the issue. I’ve made a different post with dmesg and lspci -vvv from the devkit. It can be found here: Micron SSD not recognized by Jetson AGX Xavier

This isn’t anything definitive, but some observations based on the verbose lspci on the other computer…

This device is capable of PCIe v4, but on that system, signal quality (or capabilities of the socket) limit this to PCIe v3 speeds. That’s still pretty good.

It has advanced error reporting (AER), so it might recover or mention some errors even if fails.

The content within your device uses the “nvme” driver. You might want to post this from the Jetson:
zcat /proc/config.gz | grep NVME
(you should see, at a minimum, an enabled CONFIG_NVME_CORE and CONFIG_BLK_DEV_NVME)

However, the NVME drivers won’t matter if PCIe itself fails. The reason I asked for this from a second computer is to verify the specs, functionality, and drivers from the other computer. What remains is PCIe if you have those drivers on the Jetson (it sounds like you do, and I am thinking this is a PCIe issue). The fully verbose “sudo lspci -vvv” is needed now from the Jetson. You don’t have to limit it to one device since the slot number will differ, and it might be useful to see the bridge, but it is best as an attached file. On the Jetson you can run this for the fully verbose listing and get a log file which you can attach to the forums:
sudo lspci -vvv 2>&1 | tee log_lscpi.txt

Thanks! I’ve found the issue. Apparently it was the PCIe CLKREQ pin that caused the issues. We’ve measured it before and it was 100MHz correctly, but maybe the initial state caused it or because the low power driver enable was disabled. We’ve fixed it by changing the pinmux configuration for this pin. The odd thing is that it properly worked for the other SSDs, so the Micron SSDs apparently have stricter requirements for this CLKREQ pin.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.