Have ~40 NVMe drives, all same model. Half work w/ TX2, half don't. All recognized by Win10.

Hello all,

Specifically, I have Toshiba KXG50ZNV1T02 (1TB) NVMe M.2 drives.
I’m using an HP workstation running W10 64-bit.
My Linux setup is Ubuntu 16.04 LTS running on an NVIDIA Jetson TX2 (ARMv8) attached to a developer kit carrier board.
I’m using a PCI-E adapter board (StarTech PEX4M2E1) to connect the M.2 drives.

I started with a batch of about 40 drives. 60% of them were detected successfully in both operating systems.
The remaining 40% are detected by Windows only.

I’ve tried deleting all partitions, formatting as exFAT and as NTFS. None of these routes have resulted in success. Tried GPT and MBR partition tables, with partitions and without. No luck.

Formatting the drives as NTFS and then rebooting the workstation PC with a Ubuntu Live USB, I am able to see the drive, run ntfsfix, and mount it. From there, I can reformat as EXT4 or any other Linux partition, but the drives STILL don’t show up when plugged into the TX2.

I read on this forum about rebuilding the kernel such that CONFIG_PCI_TEGRA=y, instead of CONFIG_PCI_TEGRA=m, which is what it currently reads. But that still doesn’t account for why some drives worked and some didn’t, right?

Thanks for your time.

Phil

Do either of the Windows or Ubuntu PCs use the same PCIe adapter? With that exact adapter, does a single TX2 see some NVMe, but not others?

If the PCIe card is seen via “lspci”, but the drive itself is not seen, then the issue is between the PCIe card and the NVMe. If the card itself cannot be seen, then the issue is between the TX2 and the PCIe card. If some NVMe can be seen on an adapter, but not other NVMe for the same adapter, then it looks like a marginal signal issue (I don’t know if the issue also follows the PCIe adapter or if the adapter can be shown as working in at least some circumstances).

One thing not clear here is, whether the drives are not getting detected at PCIe level itself (which can be confirmed by dumping ‘sudo lspci -vvvv’ output) or the drive is not populated Linux (like /dev/nvme0n1 etc…)?

I’m moving the PCIe card between the workstation and the TX2 with the NVME drives attached to it.

When the drive is working correctly, I can ls /dev/nvme0n1.
When it’s not working, I cannot.

I connected a different, ADATA NVME drive, same PCIe adapter card.
The drive was detected and I could mount it in Ubuntu on the TX2.

I ran ‘sudo lspci -vvvv’ but I’m not familiar enough with what I’m looking at.
I’ll provide the output in the next post.
The ADATA drive was connected successfully via the adapter card when I ran this:

00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1) (prog-if 00 [Normal decode])
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 388
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: 50100000-501fffff
	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Subsystem: NVIDIA Corporation Device 0000
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
		Mapping Address Base: 00000000fee00000
	Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag+ RBE+
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Off, PwrInd On, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet+ LinkState+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Kernel driver in use: pcieport

01:00.0 Non-Volatile memory controller: Device 1cc1:8201 (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Device 1cc1:8201
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 388
	Region 0: Memory at 50100000 (64-bit, non-prefetchable) 
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s <1us, L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [158 v1] #19
	Capabilities: [178 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [180 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
	Kernel driver in use: nvme

About lspci…you will see a “brief” entry like this if you don’t use verbose option:

01:00.0 Non-Volatile memory controller: Device 1cc1:8201 (rev 03) (prog-if 02 [NVM Express])

In this case, the PCIe slot is “01:00.0”. You could limit the lspci query to just this via:

lspci -s 01:00.0

Then lspci takes up to three “-v” options to increase verbosity (four won’t hurt, but four also won’t add any verbosity). To get full verbose information you need root permission, so this would be maximum information about that slot:

sudo lspci -s 01:00.0 -vvv

This particular card is capable (“LnkCap”, advertised link capability) of using revision 3 speeds, the 8GT/s. The link is actually at revision 2 speed (“LnkSta”, link status, 5GT/s), so you have good signal quality, but either due to signal quality or port capability is only operating at rev. 2 speed (this is not an error, it is an automatic tuning; 5GT/s with 4 lanes will far exceed the NVMe speed capabilities unless it is in a RAID configuration for performance). Note that the control commands are running at 8GT/s, and control would never need significant data throughput, but it is nice to know the signal quality is running well here.

The advanced error reporting (“AER”) has no listed errors…the pointer to the first error is NULL (“First Error Pointer”). The PCIe is 100% valid in that lspci, and functioning as it should. Any issue is due to NVMe software or other software which is unrelated to the PCIe bus.

Apparently that log was from the Jetson since the bridge was from NVIDIA. I couldn’t tell you what the NVMe issue is, but if you find a failure and run the verbose lspci again during the failure, and if there is only a “00” (null) for “AERCap: First Error Pointer: 00”, then you’ve proven PCIe is working as expected. The implication is that debugging of NVMe is required, and that PCIe is not in the way of it working.

Log attached in comment #5 has an NVMe drive connected to the system. Confirmed by the following line

01:00.0 Non-Volatile memory controller: Device 1cc1:8201 (rev 03) (prog-if 02 [NVM Express])

and the driver is also binding to the device. Confirmed by the following line

Kernel driver in use: nvme

In this case, if you are able to see /dev/nvme0n1, then, it is expected. Can you please confirm that you do see /dev/nvme0n1. If yes, Can you please attach same log for failure case as well?

When a functioning drive is connected, I can see /dev/nvme0n1.

With a “bad” drive attached “lspci” yields nothing. Terminal just goes to the next line.
Same story if I run “lspci” with only the adapter board and no drive attached.

Reminder, these “bad” drives work if I plug them into Windows.

This is weird. Are good and bad devices are of the same type? Did you happen to compare the ‘sudo lspci -vvvv’ output (when devices are connected to an x86 Linux machine) of good and bad devices? Otherwise, I don’t see any reason why some drives get detected and others don’t.

Just to clarify, can you confirm that lspci does or does not show the PCIe card itself when a bad drive is connected? Is it just the dev special file for the drive itself which disappears?

If either the drive disappears, or if the PCIe card and drive disappear, then it seems a signal was in a state that the missing device can’t even run at its lowest spec. In that case, if it is the same hardware, then I’d have to expect clock signals or power rails are involved (such as a power rail dropping to a point where the clock is no longer high enough quality…extra jitter, lower levels, so on). A good drive with a slight droop in power or loss of clock quality could do this.

Are all of the Jetsons using the same power supply and carrier board?