M.2 Key M Device not always recognized

Hi NVidia Team
We implemented the M.2 Key M slot exactly as in the reference design of the Jetson AGX Xavier. We now wanted to use a M.2 Key M CAN Modul in this slot. We have the following issue:
After a reboot or poweroff, the device is sometimes not recognized at all (lspci not listing the device, dmesg says “PCIe link is down”. We already tried to set the max speed of the pcie@14180000 controller to GEN2 and GEN1, without success. Also disabling clockrequest and/or powerdown did not change the behaviour. Any help is appreciated. The module itself uses one pcie lane, the clockrequest and the PERST# signal and power rails.
Thank you.

One thought is that if the PCAN module is itself not ready at the time of PCI query, then the device would not be recognized. Hot plug is not enabled for this PCIe, and although I doubt your unit would be of a type which takes too long, there are many FPGA examples which require setting up for late PCIe scan. Someone else would have to describe how to do this, but can you confirm if or if not the PCAN module might have some sort of significant boot time before it becomes available? Even short times like 1 second longer than a usual PCIe card might matter.

Hi,
A couple of things here.

  • Have you tried with any NVMe device (which goes into the M.2 Key-M slot directly) the reboot/poweroff experiment (where you see issues with your PCAN module)? This would at least rule out any issue with the custom M.2 Key-E slot itself.

  • Has your PCAN module been tested on any other system before? and how did it work there?

  • Since you said the PCAN module is using only one lane, can you try setting the ‘num-lanes’ property in the respective PCIe node (i.e. C0 controller pcie@14180000) to ‘1’ and check?

  • In the event of the link not coming up, did you check the ‘DLActive’ in LnkSta (Link Status register) in ‘lspci -vvvv’ output? BTW, this is to be checked after disabling the power down of the controller. In case if DLActive of the root port happens to be ‘1’, then, we can confirm that the PCAN module is taking time to get the link up.

Hi vidyas

  • We checked the functionality of the M.2 Key M slot with an NVMe device without any issues.
  • Almost the same PCAN module in a mPCIe form factor (according to the manufacturer with the same firmware, only difference mechanical form factor and power rails) works always on the same system. Our mPCIe slot uses UPHY_TX0/RX0.
    We will also test the M.2 Key M card on an x86 system and come back to you.
  • Setting num-lanes to 1 did not help. We also set the phys and phy-names to only the first lane.
  • This is the output when device is not recognized

0000:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 34
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 0000f000-00000fff
Memory behind bridge: fff00000-000fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM not supported, Exit Latency L0s <1us, L1 <64us
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootCap: CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=8 Masked-
Vector table: BAR=2 offset=00000000
PBA: BAR=2 offset=00010000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [148 v1] #19
Capabilities: [168 v1] #26
Capabilities: [190 v1] #27
Capabilities: [1c0 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=60us PortTPowerOnTime=40us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=10us
L1SubCtl2: T_PwrOn=10us
Capabilities: [1d0 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?> Capabilities: [2d0 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [308 v1] #25
Capabilities: [314 v1] Precision Time Measurement
PTMCap: Requester:+ Responder:+ Root:+
PTMClockGranularity: 16ns
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Capabilities: [320 v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Kernel driver in use: pcieport

DLActive seems to be ‘0’. Any other suggestions what to check?
Thank you.

We can confirm that on a x86 system, we do not see any issues with the PCAN M.2 Module. Is there a possibility to add a delay to the pcie enumeration at startup? “nvidia,boot-detect-delay” seems to have no effect.

“nvidia,boot-detect-delay” works for Tegras till TX2. It doesn’t work for AGX
The following patch can be used for introducing the delay. Play around with the value to fine-tune it.

diff --git a/drivers/pci/dwc/pcie-tegra.c b/drivers/pci/dwc/pcie-tegra.c
index 4cd746b8b..bdaf8849a 100644
--- a/drivers/pci/dwc/pcie-tegra.c
+++ b/drivers/pci/dwc/pcie-tegra.c
@@ -2839,6 +2839,9 @@ static int tegra_pcie_dw_host_init(struct pcie_port *pp)

        clk_set_rate(pcie->core_clk, GEN4_CORE_CLK_FREQ);

+       pr_info("---> Adding some delay before appling PERST\n");
+       msleep(500);
+
        /* assert RST */
        val = readl(pcie->appl_base + APPL_PINMUX);
        val &= ~APPL_PINMUX_PEX_RST;

Hi vidyas
Thank you for the patch. We applied it and were able to see from the “dmesg”, that the delay was added. But it did not help with the detection issue of the M.2 PCAN Module. We tested with different values up to 5 seconds without success.
Any other ideas what we can check?

I’m afraid we can’t do much at this point other than connecting the PCIe protocol analyzer and see what is going wrong.

I don’t think PEX_WAKE can play a role here. BTW, I don’t see REFCLK lanes above. Isn’t the converter forwarding PCIe reference clock lanes?

The reference clock pcie lanes are pin 55 and 53 (names missing in the picture), they are connected. Did you see anything in our device-tree that is not correct?

Your DT file looks good to me.
BTW, I’m not very clear on what is the difference (in terms of PCIe signals) from the previous connector and this one? Can you please list down
a) extra signals that this form factor is routing?
b) missing signals in this form factor?

Hi vidyas
The Signals should be exactly the same on both form factors. And the mPCIe to M.2 Adapter only allows to use the M.2 Module on a mPCIe slot.
In summary, the module works always connected to UPHY0 and also UPHY7, but not with UPHY2.

Hi vidyas
Can you tell us if there is any possibility to change settings of the pcie controllers like the signal level of the RefCLK?

We saw that the Ref-Clk (100MHz) signal level of the M.2 connector (UPHY2) was with ~175mV lower than for example the one of our mPCIe Connector with ~185 mV (UPHY0, UPHY7). Compared with an x86 desktop computer where the signal is over 300mV.

Is this difference explainable/normal? Are there settings to equalize the M.2 and the mPCIe?
Thank you.

Based on your description of the differences in the REFCLK, I think it could be due to that as well, but again, we can’t be of much help here as we don’t have much insight into the custom board. I hope this board is designed according to the design guidelines published.
Tegra already gives the REFCLK within the spec-defined voltage limits. We need to have a good connector/adapters to take it till the endpoint. I think that part seems to be missing here.