Lost NVMe SSD on L4T 31.1.0 upgrade

After using JetPack 4.1.1 to upgrade from L4T 31.0.2 to L4T 31.1.0, the NVMe SSD installed in the M.2 Key M slot is no longer available/visible. The disk itself is a Western Digital 500GB NVMe SSD: Amazon.com

What’s the procedure to hunt this down?

opss…
Think i will wait with updating until we get an answer on that , i am using a 250gb Samsung 970 Evo, have all my work on it, so don’t want to lose it :)

Have you checked if “/proc/config.gz” shows:

CONFIG_BLK_DEV_NVME=y

…maybe it is just an option left out.

Thank you for the suggestion.
config.gz shows:

CONFIG_NVME_CORE=y
CONFIG_BLK_DEV_NVME=y

That does not appear to be the issue.

Interesting. I tried booting several times after powering down. The SSD did not show.

After power down, I then inserted a USB thumb drive, booted the machine, and the SSD showed up. However, the USB thumb drive did not. After powering down and booting again, the SSD shows, but not the USB thumb drive. Unplugging and replugging the USB drive, the USB drive then showed up. Now both the SSD and USB thumb drive are available.

It would be interesting to have a boot log for a case where both are connected and (a) the thumb drive doesn’t show, and again for when (b) the SSD does not show. Putting them side by side might reveal a pattern. Also perhaps the log of unplugging and replugging one of the missing devices.

What does “lsusb” show when either device is missing? If lsusb shows both devices, then you know the PHY and controller see the device, but a driver isn’t loading. If lsusb does not show both devices, then perhaps something is preventing the controller itself from working. “lsusb -t” might also be interesting if lsusb shows both devices since it would mention what speed they operate at.

Is this simply a case of a missing /etc/fstab entry?

What does “sudo lsblk” show?
Can you mount the device manually?
Something like “sudo mount /dev/nvme0n1p1 /mnt” (or whatever the partition number is on your M2 drive?)

If you can mount it manually, then adding it to /etc/fstab should make it come back each time you boot. /etc/fstab is unfortunately blown away by each jetpack update.

@snarky After flashing, lsblk:

NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0          7:0    0    16M  1 loop 
mmcblk0      179:0    0  29.1G  0 disk 
├─mmcblk0p1  179:1    0    28G  0 part /
├─mmcblk0p2  179:2    0   500K  0 part 
├─mmcblk0p3  179:3    0   500K  0 part 
├─mmcblk0p4  179:4    0     4M  0 part 
├─mmcblk0p5  179:5    0     4M  0 part 
├─mmcblk0p6  179:6    0   512K  0 part 
├─mmcblk0p7  179:7    0   512K  0 part 
├─mmcblk0p8  179:8    0   384K  0 part 
├─mmcblk0p9  179:9    0   384K  0 part 
├─mmcblk0p10 179:10   0     2M  0 part 
├─mmcblk0p11 179:11   0     2M  0 part 
├─mmcblk0p12 179:12   0   128K  0 part 
├─mmcblk0p13 179:13   0     1M  0 part 
├─mmcblk0p14 179:14   0     1M  0 part 
├─mmcblk0p15 179:15   0     1M  0 part 
├─mmcblk0p16 179:16   0     1M  0 part 
├─mmcblk0p17 179:17   0   256K  0 part 
├─mmcblk0p18 179:18   0   256K  0 part 
├─mmcblk0p19 179:19   0   512K  0 part 
├─mmcblk0p20 179:20   0   512K  0 part 
├─mmcblk0p21 179:21   0     4M  0 part 
├─mmcblk0p22 179:22   0     4M  0 part 
├─mmcblk0p23 179:23   0   512K  0 part 
├─mmcblk0p24 179:24   0   512K  0 part 
├─mmcblk0p25 179:25   0     6M  0 part 
├─mmcblk0p26 179:26   0     6M  0 part 
├─mmcblk0p27 179:27   0   128M  0 part 
├─mmcblk0p28 179:28   0   128M  0 part 
├─mmcblk0p29 179:29   0    64M  0 part 
├─mmcblk0p30 179:30   0    64M  0 part 
├─mmcblk0p31 179:31   0   512K  0 part 
├─mmcblk0p32 259:0    0   512K  0 part 
├─mmcblk0p33 259:1    0     1M  0 part 
├─mmcblk0p34 259:2    0     8M  0 part 
├─mmcblk0p35 259:3    0     8M  0 part 
└─mmcblk0p36 259:4    0 708.6M  0 part 
mmcblk0boot0 179:32   0     8M  1 disk 
mmcblk0boot1 179:64   0     8M  1 disk 
mmcblk0rpmb  179:96   0     4M  0 disk

After cold booting several times:

NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0          7:0    0    16M  1 loop 
mmcblk0      179:0    0  29.1G  0 disk 
├─mmcblk0p1  179:1    0    28G  0 part /
├─mmcblk0p2  179:2    0   500K  0 part 
├─mmcblk0p3  179:3    0   500K  0 part 
├─mmcblk0p4  179:4    0     4M  0 part 
├─mmcblk0p5  179:5    0     4M  0 part 
├─mmcblk0p6  179:6    0   512K  0 part 
├─mmcblk0p7  179:7    0   512K  0 part 
├─mmcblk0p8  179:8    0   384K  0 part 
├─mmcblk0p9  179:9    0   384K  0 part 
├─mmcblk0p10 179:10   0     2M  0 part 
├─mmcblk0p11 179:11   0     2M  0 part 
├─mmcblk0p12 179:12   0   128K  0 part 
├─mmcblk0p13 179:13   0     1M  0 part 
├─mmcblk0p14 179:14   0     1M  0 part 
├─mmcblk0p15 179:15   0     1M  0 part 
├─mmcblk0p16 179:16   0     1M  0 part 
├─mmcblk0p17 179:17   0   256K  0 part 
├─mmcblk0p18 179:18   0   256K  0 part 
├─mmcblk0p19 179:19   0   512K  0 part 
├─mmcblk0p20 179:20   0   512K  0 part 
├─mmcblk0p21 179:21   0     4M  0 part 
├─mmcblk0p22 179:22   0     4M  0 part 
├─mmcblk0p23 179:23   0   512K  0 part 
├─mmcblk0p24 179:24   0   512K  0 part 
├─mmcblk0p25 179:25   0     6M  0 part 
├─mmcblk0p26 179:26   0     6M  0 part 
├─mmcblk0p27 179:27   0   128M  0 part 
├─mmcblk0p28 179:28   0   128M  0 part 
├─mmcblk0p29 179:29   0    64M  0 part 
├─mmcblk0p30 179:30   0    64M  0 part 
├─mmcblk0p31 179:31   0   512K  0 part 
├─mmcblk0p32 259:0    0   512K  0 part 
├─mmcblk0p33 259:1    0     1M  0 part 
├─mmcblk0p34 259:2    0     8M  0 part 
├─mmcblk0p35 259:3    0     8M  0 part 
└─mmcblk0p36 259:4    0 708.6M  0 part 
mmcblk0boot0 179:32   0     8M  1 disk 
mmcblk0boot1 179:64   0     8M  1 disk 
mmcblk0rpmb  179:96   0     4M  0 disk 
nvme0n1      259:5    0 465.8G  0 disk 
└─nvme0n1p1  259:6    0 465.8G  0 part

It shows up as:

nvme0n1      259:5    0 465.8G  0 disk 
└─nvme0n1p1  259:6    0 465.8G  0 part

After flash, there is no dev/nvme*
Once the device is found:

/dev/nvme0
/dev/nvme0n1
/dev/nvme0n1p1

lspci:

After flash:

0001:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad2 (rev a1)
0001:01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9171 (rev 13)

After cold booting and eventually working:

0000:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0000:01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5002
0001:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad2 (rev a1)
0001:01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9171 (rev 13)

In all cases, /etc/fstab is the same:

# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# NVIDIA CORPORATION and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto.  Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA CORPORATION is strictly prohibited.
#
# /etc/fstab: static file system information.
#
# These are the filesystems that are always mounted on boot, you can
# override any of these by copying the appropriate line from this file into
# /etc/fstab and tweaking it as you see fit.  See fstab(5).
#
# <file system> <mount point>             <type>          <options>                               <dump> <pass>
/dev/root            /                     ext4           defaults                                     0 1

After flashing, a simple restart does not appear to fix the issue. Cold booting provides mixed results. Sometimes it will work after one cold boot, at other times it can take several. As stated previously, one time I got it to work only after inserting a USB thumb drive. However this may just be coincidental that it happened to work that time. However, since the thumb drive did not show up at that point, it seems to point to a different issue.

@linuxdev Attached is the dmesg after flashing. The NVMe SSD is not detected.
flashdmesg.txt (74.5 KB)

@linuxdev And after cold booting and eventually getting it to show up. Simply restarting does not solve the issue.
I’ll leave the USB exploration to another time. It could be that I was not diligent enough in my observations at the time.
coldbootdmesg.txt (71.4 KB)

It seems that an fstab entry would not help since the NVMe controller itself sometimes shows up, and other times does not. Similar for the USB side. My gut feeling is that it is a software issue with some sort of resource conflict, although it could be a signal quality issue.

In a case where the NVMe does show up, e.g., via this lspci example:

0000:01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5002

…the slot would be “0000:01:00.0”, and you could make a verbose query specific to that slot via:

sudo lspci -vvv -s 0000:01:00.0

The verbose listing while the unit is recognized and operating might provide clues. For example, if it is capable of PCIe rev. 2 or 3 and has not fallen back to rev. 1 speed, then I’d think there is no signal issue (it wouldn’t have been able to reach rev. 2 or 3 if the signal was marginal). There isn’t much more I could answer on that, but post that verbose lspci for the case where it works.

The technical term for this is “fucked up.”

1 Like

@snarky Yepper, that’s why we report the issues, especially in the preview edition versions. First so NVIDIA knows about them, and second so that they can fix them.

@linuxdev As requested:

nvidia@jetson-0422818069391:~$ sudo lspci -vvv -s 0000:01:00.0
[sudo] password for nvidia: 
0000:01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5002 (prog-if 02 [NVM Express])
	Subsystem: Sandisk Corp Device 5002
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 32
	Region 0: Memory at 38200000 (64-bit, non-prefetchable) 
	Region 4: Memory at 38204000 (64-bit, non-prefetchable) 
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [b0] MSI-X: Enable+ Count=65 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=4 offset=00000000
	Capabilities: [c0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s <256ns, L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [150 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [1b8 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [300 v1] #19
	Capabilities: [900 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=40us
	Kernel driver in use: nvme

Kangalow,

Sorry for late reply. Could you make a summary of current status?

It looks like your NVMe would work and fail from time to time. How about the usb drive?
Is it always involved in the test?

@WayneWWW Here is the scenario:
A NVMe SSD is installed in the M.2 Key M slot. The SSD is identified and works correctly after flashing with JetPack 4.1.

After flashing with JetPack 4.1.1 the SSD is not identified and does not appear. Checking lsblk, lspci and /dev/nvme* does not show the device (logs above). Warm restarting the Xavier after flashing results in the same issue.

Multiple cold reboot appears to eventually solve the issue. On one occasion, it appeared after the first cold reboot. Several other times it took multiple cold reboots to appear. Once the device comes online, it appears to work normally. Also, the device will then act as expected between restarts.

The SSD was originally installed and formatted using L4T 31.0.2. After flashing with JetPack 4.1 with the SSD installed, the SSD is immediately available. After flashing with JetPack 4.1.1, the SSD does not appear as described above.

For the moment, let’s ignore the USB observation. I was not diligent enough at the time in observing the steps to recreate the issue.

Just a note from lspci: The device is capable of gen. 3 speeds, and is actually running at gen. 3. I’d say the signal quality is close to perfection and is in no way a cause for when the device becomes invisible. This leaves software as a problem, e.g., perhaps there is an issue in time required after a power rail goes up and the time is too short.

It could still be signal quality. PCI Ex gen 3 runs active equalization on the link – if that setup/training period doesn’t work right, for whatever reason, it’d generate a signal quality issue.
For example, there may be EMI present during start-up/boot of the module (when training happens) that’s not present later.
There may also be all kinds of other issues, including some loose connection or solder joint that makes contact after the board/module have warmed up a little bit. In this case, flashing gives it time to cool down, and thus it doesn’t work for a bit.
It’s almost impossible to know exactly what the problem is yet, and thus I wouldn’t say “it must be software” just yet, at least not until someone else has reproduced the same problem.

I am wondering does this issue happen to specific NVMe device or all devices.
Internal test is ongoing.

@WayneWWW If a USB thumb drive is left in the USB hub, it appears normally after the flash. The NVMe SSD does not. I can generate another NVMe SSD under 4.1 with a Samsung 960 EVO to try to reproduce the problem if you cannot reproduce on your end.

@snarky It could be as you describe, but seems unlikely to be a hardware issue. The device identifies and works correctly when flashing with JetPack 4.1. Using JetPack 4.1.1 on the other hand does not detect the drive when the Jetson first boots from the flash or on subsequent warm restarts. Cold rebooting eventually will find the drive, but it feels much more like a driver issue than hardware related.

Our team has tried Samsung NVMe and found it is working. Please try it on your device as well.