What is the meaning of the Link Down counter on QM9700 Switch? How do I find the cause of the failure?

I used CX7 to connect the QM9700 switch, but it was running for a long time and an error occurred (Link down counter was not zero, I had sent a command to clear it before testing)
a
What makes it happen and why? Is there any command that will allow me to find the exact cause of the reported error

The module is 800G OSFP. The information I query is as follows:

Operational Info

State : Active
Physical state : LinkUp
Speed : IB-NDR
Width : 4x
FEC : Standard_RS-FEC - (544,514)
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0x00000080 (NDR)
Supported Cable Speed : 0x00000080 (NDR)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 31.2012.1068
amBER Version : 2.8
MFT Version : mft 4.27.0-83

Module Info

Identifier : OSFP
Compliance : IB NDR,400G-SR4
Cable Technology : 850 nm VCSEL
Cable Type : Optical Module (separated)
OUI : Other
Vendor Name : xxxx
Vendor Part Number : 2X400GMPO
Vendor Serial Number : 201111111
Rev : A1
Wavelength [nm] : 854
Transfer Distance [m] : 0.0
Attenuation (5g,7g,12g,25g) [dB] : N/A
FW Version : 1.23.0
Digital Diagnostic Monitoring : Yes
Power Class : 15.0 W max
CDR RX : ON,ON,ON,ON
CDR TX : ON,ON,ON,ON
LOS Alarm : N/A
Temperature [C] : 65 [-10…80]
Voltage [mV] : 3281.2 [2970…3630]
Bias Current [mA] : 9.452,9.466,9.466,9.452 [0…15]
Rx Power Current [dBm] : 1.300,0.955,0.973,0.945 [-10.41…5]
Tx Power Current [dBm] : -0.150,0.330,0.515,0.418 [-8.416…5]
SNR Media Lanes [dB] : 0,0,0,0
SNR Host Lanes [dB] : 0,0,0,0
IB Cable Width : 1x,2x,4x,8x
Memory Map Revision : 64
Linear Direct Drive : 0
Cable Breakout : Unspecified
SMF Length : N/A
MAX Power : 60
Cable Rx AMP : 1
Cable Rx Emphasis (Pre) : 0
Cable Rx Post Emphasis : 0
Cable Tx Equalization : 0
Wavelength Tolerance : 14.0nm
Module State : Ready state
DataPath state [per lane] : DPActivated,DPActivated,DPActivated,DPActivated
Rx Output Valid [per lane] : 0,0,0,0
Nominal bit rate : N/A
Rx Power Type : Average power
Manufacturing Date : 25_12_23
Active Set Host Compliance Code : IB NDR
Active Set Media Compliance Code : 400G-SR4
Error Code Response : ConfigUndefined
Module FW Fault : 0
DataPath FW Fault : 0
Tx Fault [per lane] : 0,0,0,0
Tx LOS [per lane] : 0,0,0,0
Tx CDR LOL [per lane] : 0,0,0,0
Rx LOS [per lane] : 0,0,0,0
Rx CDR LOL [per lane] : 0,0,0,0
Tx Adaptive EQ Fault [per lane] : 0,0,0,0

2024-05-28 15:07:11,422 -

Operational Info

State : Active
Physical state : LinkUp
Speed : IB-NDR
Width : 4x
FEC : Standard_RS-FEC - (544,514)
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0x00000080 (NDR)
Supported Cable Speed : 0x00000080 (NDR)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 31.2012.1068
amBER Version : 2.8
MFT Version : mft 4.27.0-83

Physical Counters and BER Info

Time Since Last Clear [Min] : 4302.8
Symbol Errors : 0
Symbol BER : 15E-255
Effective Physical Errors : 0
Effective Physical BER : 15E-255
Raw Physical Errors Per Lane : 48905455907,7934132259,2920644571,787673008
Raw Physical BER : 5E-7
Link Down Counter : 2
Link Error Recovery Counter : 0

2024-05-28 15:07:11,785 -

Operational Info

State : Active
Physical state : LinkUp
Speed : IB-NDR
Width : 4x
FEC : Standard_RS-FEC - (544,514)
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0x00000080 (NDR)
Supported Cable Speed : 0x00000080 (NDR)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 31.2012.1068
amBER Version : 2.8
MFT Version : mft 4.27.0-83

Histogram of FEC Errors

Header : Range Occurrences
Bin 0 : [0] 20111015590768
Bin 1 : [1] 58829955432
Bin 2 : [2] 827585568
Bin 3 : [3] 19803041
Bin 4 : [4] 808748
Bin 5 : [5] 45105
Bin 6 : [6] 3254
Bin 7 : [7] 298
Bin 8 : [8] 30
Bin 9 : [9] 1
Bin 10 : [10] 2
Bin 11 : [11] 0
Bin 12 : [12] 0
Bin 13 : [13] 0
Bin 14 : [14] 0
Bin 15 : [15] 0

Operational Info

State : Active
Physical state : LinkUp
Speed : IB-NDR
Width : 4x
FEC : Standard_RS-FEC - (544,514)
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0x00000080 (NDR)
Supported Cable Speed : 0x00000080 (NDR)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 31.2012.1068
amBER Version : 2.8
MFT Version : mft 4.27.0-83

Unexplainable link flaps are often caused by either a dirty or a mis-seated cable. From your output below, it’s been a few days since you last cleared the counters:


Time Since Last Clear [Min] : 4302.8


I would continue to monitor and if the link flaps again for no explainable reason, cleaning and reseating both ends of the cable being used for that link would be the first logical step. If any firmware or software has been upgraded recently, review the compatibility matrix at the beginning of the SW/FW release notes to ensure everything is properly aligned.

If there are still issues at that point and if you have a valid Nvidia support contract, please open a technical support case for more customized/detailed help.

I think it might be a problem with the 800G OSFP module, but I don’t know what makes it happen(Whether there are commands for querying detailed information, such as mlxlink, ibdiagnet, etc).
In our test environment, I used 400G OSFP CX7 to connect two QM9700s. All QM9700s have 32 800G OSFP modules plugged in, and Link Down counter errors occur only during long runs.(Before the module is used, the fiber is clean, and after all module status shows Link UP, I send the command ibdiagnet - pc to clear the counter)
I will open the technical support later.
Thanks for your reply