Xavier-nx: pcie error and machine reboot

HI, nvidia team:

The pcie of nx will error and reboot, the uart debug log as follow:

��[   28.35199��safere��5] igb 0004:07:00.��g_poll_timer_cb��0 eth1: PCIe li��: poll inter��nk los��val��t, device no�� 106 above t��w deta��argd
��et 100 
safereg_poll_timer_cb: poll interval 446 above target 100 
��[   30.022243] igb 0004:07:00.0 eth1: malformed Tx packet detected and dropped, LVMMC:0xfffff��saf��fff
��ereg_poll_timer_cb: poll interval 103 above target 100 
safereg_poll_timer_cb: poll interval 103 above target 100 
��[   36.823048] bpmp: mrq 22 took 3996000 us
[   36.825335] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Transmitter ID)
[   36.828440] pcieport 0004:00:00.0:   device [10de:1ad1] error status/mask=00009001/0000e000
[   37.028126] igb 0004:07:00.2 eth3: PCIe link lost, device now detached
[   37.079694] igb 0004:07:00.1 eth2: PCIe link lost, device now detached
[   37.135140] pcieport 0004:00:00.0:    [ 0] Receiver Error         (First)
[   37.137026] pcieport 0004:00:00.0:    [12] Replay Timer Timeout  
[   37.138997] pcieport 0004:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0000(Requester ID)
[   37.141196] pcieport 0004:00:00.0:   device [10de:1ad1] error status/mask=00004020/00400000
[   37.143122] pcieport 0004:00:00.0:    [ 5] Surprise Down Error   
[   37.145038] pcieport 0004:00:00.0:    [14] Completion Timeout     (First)
[   37.593029] igb 0004:07:��safereg_poll_timer��00.3 eth4: PCIe li��_cb: poll interval 106 above target 100 
��nk lost, device now detached
��safereg_poll_timer_cb: poll interval 113 above target 100 
safereg_poll_timer_cb: poll interval 106 above target 100 
��[   39.543876] igb 0004:07:00.2 eth3: malformed Tx packet detected an��safereg_poll��d d��_ti��rop��mer��ped, LVMMC:0xff��_cb: poll in��ffffff
��terval 115 above target 100 
��[   39.708664] igb 0004:07:00.1 eth2: malformed Tx packet��safereg_poll_timer_cb: poll interval 292 abov�� detected an��e target ��d dropped, f
��100 
safereg_poll_timer_cb: poll interval 207 above target 100 
safereg_poll_timer_cb: poll interval 119 above target 100 
safereg_poll_timer_cb: poll interval 103 above target 100 
safereg_poll_timer_cb: poll interval 105 above target 100 
safereg_poll_timer_cb: poll interval 120 above target 100 
safereg_poll_timer_cb: poll interval 106 above target 100 
safereg_poll_timer_cb: poll interval 128 above target 100 
��[   57.983221] INFO: rcu_preempt self-detected stall on CPU[   57.983232] INFO: rcu_preempt detected stalls on CPUs/tasks:
[   57.983252]  4-...: (1 GPs behind) idle=8cd/140000000000001/0 softirq=5928/5931 fqs=313 
[   57.983255] 
[   57.990164]  4-...: (1 GPs behind) idle=8cd/140000000000001/0 softirq=5928/5931 fqs=315 
[   57.990173]   (t=5266 jiffies g=553 c=552 q=10466)
[   58.248160] INFO: rcu_sched detected stalls on CPUs/tasks:
[   58.248176]  4-...: (1 GPs behind��safereg_poll_timer_cb: poll in��) idle=8cd/1400000��terval 112 abov��00000001/0 softirq=5930/593��e target 
��1 fqs=269 
��safereg_poll_timer_cb: poll interval 196 above target 100 
��[   58.248184]        (detected by 0, t=5253 jiffies, g=111, c=110, q=59)
��safereg_poll_timer_cb: poll interval 108 above target 100 
safereg_poll_timer_cb: poll interval 106 above target 100 
safereg_poll_timer_cb: poll interval 118 above target 100 
safereg_poll_timer_cb: poll interval 109 above target 100 
��[   71.088017] pcieport 0004:00:00.0: PCIe Bus Error: severity=Uncorrect��safereg_poll_timer_cb��ed (Non-Fatal),��: poll interval�� type=Trans 
��uester ID)
��safereg_poll_timer_cb: poll interval 226 above target 100 
��[   71.550316] pcieport 0004:00:00.0:   device [10de:1ad1] error status/mask=00004020/00400000
[   71.858412] pcieport 0004:00:00.0:    [ 5] Surprise Down Error   
��safereg_poll_timer_cb: poll interval 112 above target 100 
��[   72.115160] pcieport 0004:00:00.0:    [14] Completion Timeout     (First)
��safereg_poll_timer_cb: poll interval 130 above target 100 
safereg_poll_timer_cb: poll interval 116 above target 100

could you tell me what show i do?

Thanks.

Hi,

Could you explain your HW setup?

  1. Is it devkit or NX module+ custom CVB?
  2. If it is devkit, how are you connecting Intel igb to M.2 Key E?

I see “surprise down” error, so I suspect HW setup here. One thing you can try is limiting the PCIe speed to Gen1, apply following patch and flash DTB.

diff --git a/common/tegra194-p3668-common.dtsi b/common/tegra194-p3668-common.dtsi
index 1f923ed268ac…69dedbe5df23 100644
— a/common/tegra194-p3668-common.dtsi
+++ b/common/tegra194-p3668-common.dtsi
@@ -330,7 +330,7 @@
vddio-pex-ctl-supply = <&p3668_spmic_sd3>;
nvidia,disable-aspm-states = <0xf>;
nvidia,enable-power-down;

  •           nvidia,max-speed = <3>;
    
  •           nvidia,max-speed = <1>;
    
              num-lanes = <1>;
              phys = <&p2u_11>;
    

Thanks,
Manikanta

Hi cxl824158933,

Is this still an issue to support?
If yes, could you help to reply with previous questions?