M.2 Key M Device not always recognized

Hi NVidia Team
We implemented the M.2 Key M slot exactly as in the reference design of the Jetson AGX Xavier. We now wanted to use a Peak M.2 Key M PCAN Modul in this slot. We have the following issue:
After a reboot or poweroff, the device is sometimes not recognized at all (lspci not listing the device, dmesg says “PCIe link is down”. We already tried to set the max speed of the pcie@14180000 controller to GEN2 and GEN1, without success. Also disabling clockrequest and/or powerdown did not change the behaviour. Any help is appreciated. The module itself uses one pcie lane, the clockrequest and the PERST# signal and power rails.
Thank you.

One thought is that if the PCAN module is itself not ready at the time of PCI query, then the device would not be recognized. Hot plug is not enabled for this PCIe, and although I doubt your unit would be of a type which takes too long, there are many FPGA examples which require setting up for late PCIe scan. Someone else would have to describe how to do this, but can you confirm if or if not the PCAN module might have some sort of significant boot time before it becomes available? Even short times like 1 second longer than a usual PCIe card might matter.

Hi,
A couple of things here.

  • Have you tried with any NVMe device (which goes into the M.2 Key-M slot directly) the reboot/poweroff experiment (where you see issues with your PCAN module)? This would at least rule out any issue with the custom M.2 Key-E slot itself.

  • Has your PCAN module been tested on any other system before? and how did it work there?

  • Since you said the PCAN module is using only one lane, can you try setting the ‘num-lanes’ property in the respective PCIe node (i.e. C0 controller pcie@14180000) to ‘1’ and check?

  • In the event of the link not coming up, did you check the ‘DLActive’ in LnkSta (Link Status register) in ‘lspci -vvvv’ output? BTW, this is to be checked after disabling the power down of the controller. In case if DLActive of the root port happens to be ‘1’, then, we can confirm that the PCAN module is taking time to get the link up.

Hi vidyas

  • We checked the functionality of the M.2 Key M slot with an NVMe device without any issues.
  • Almost the same PCAN module in a mPCIe form factor (according to the manufacturer with the same firmware, only difference mechanical form factor and power rails) works always on the same system. Our mPCIe slot uses UPHY_TX0/RX0.
    We will also test the M.2 Key M card on an x86 system and come back to you.
  • Setting num-lanes to 1 did not help. We also set the phys and phy-names to only the first lane.
  • This is the output when device is not recognized

0000:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 34
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 0000f000-00000fff
Memory behind bridge: fff00000-000fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM not supported, Exit Latency L0s <1us, L1 <64us
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootCap: CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=8 Masked-
Vector table: BAR=2 offset=00000000
PBA: BAR=2 offset=00010000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [148 v1] #19
Capabilities: [168 v1] #26
Capabilities: [190 v1] #27
Capabilities: [1c0 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=60us PortTPowerOnTime=40us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=10us
L1SubCtl2: T_PwrOn=10us
Capabilities: [1d0 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2d0 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [308 v1] #25
Capabilities: [314 v1] Precision Time Measurement
PTMCap: Requester:+ Responder:+ Root:+
PTMClockGranularity: 16ns
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Capabilities: [320 v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Kernel driver in use: pcieport

DLActive seems to be ‘0’. Any other suggestions what to check?
Thank you.

We can confirm that on a x86 system, we do not see any issues with the PCAN M.2 Module. Is there a possibility to add a delay to the pcie enumeration at startup? “nvidia,boot-detect-delay” seems to have no effect.

“nvidia,boot-detect-delay” works for Tegras till TX2. It doesn’t work for AGX
The following patch can be used for introducing the delay. Play around with the value to fine-tune it.

diff --git a/drivers/pci/dwc/pcie-tegra.c b/drivers/pci/dwc/pcie-tegra.c
index 4cd746b8b..bdaf8849a 100644
--- a/drivers/pci/dwc/pcie-tegra.c
+++ b/drivers/pci/dwc/pcie-tegra.c
@@ -2839,6 +2839,9 @@ static int tegra_pcie_dw_host_init(struct pcie_port *pp)

        clk_set_rate(pcie->core_clk, GEN4_CORE_CLK_FREQ);

+       pr_info("---> Adding some delay before appling PERST\n");
+       msleep(500);
+
        /* assert RST */
        val = readl(pcie->appl_base + APPL_PINMUX);
        val &= ~APPL_PINMUX_PEX_RST;

Hi vidyas
Thank you for the patch. We applied it and were able to see from the “dmesg”, that the delay was added. But it did not help with the detection issue of the M.2 PCAN Module. We tested with different values up to 5 seconds without success.
Any other ideas what we can check?

I’m afraid we can’t do much at this point other than connecting the PCIe protocol analyzer and see what is going wrong.

Hi vidyas
We did further testing. We put the PEAK M.2 Key M Module into both of our mPCIe sockets with an apater form delock:

With the adapter, the module was always detected (lspci listing the device). The module only uses the following signals:

We attach also our device tree so you can check if there is something wrong.device-tree.txt (257.5 KB)
Could it be a problem if the signal “GPIO29_M2_KEYM_PEWAKE*” is not connected? Or any other signal?
Is there a significant difference between the PCIe controllers?
Thank you.

Kind regards

I don’t think PEX_WAKE can play a role here. BTW, I don’t see REFCLK lanes above. Isn’t the converter forwarding PCIe reference clock lanes?

The reference clock pcie lanes are pin 55 and 53 (names missing in the picture), they are connected. Did you see anything in our device-tree that is not correct?

Your DT file looks good to me.
BTW, I’m not very clear on what is the difference (in terms of PCIe signals) from the previous connector and this one? Can you please list down
a) extra signals that this form factor is routing?
b) missing signals in this form factor?

Hi vidyas
The Signals should be exactly the same on both form factors. And the mPCIe to M.2 Adapter only allows to use the M.2 Module on a mPCIe slot.
In summary, the module works always connected to UPHY0 and also UPHY7, but not with UPHY2.