MCX654106 pcie link is disabled

After I set NUM_OF_PF to 0 by mstconfig, the system’s BIOS displays “pcie link training failure … the link is disabled.” Additionally, using commands like lspci on various operating systems such as Linux and FreeBSD does not show the corresponding Mellanox device. Could you please advise on how to recover this network card?

Thanks.

Do you still see the device under mst?

After boot run:

mst start

mst status

if so, please reset the configuration of the device using

mstconfig -d <mst_dev> reset

reboot.

I cant see any device under mst.

root@PowerEdge-R230:~/mft-4.25.0-62-x86_64-deb# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success
Unloading MST PCI configuration module (unused) - Success
root@PowerEdge-R230:~/mft-4.25.0-62-x86_64-deb# mst status
MST modules:

MST PCI module is not loaded
MST PCI configuration module is not loaded

PCI Devices:

No devices were found.

Hello @liangyi571,

Thank you for posting your query on our community. I would suggest you to try the following:

  1. Cold boot of the server
  2. Ensure BIOS is at latest version
  3. Move the the PCIe card to another slot/server and check if the issue remains.

If the issue still persists, I suggest you to collect a sysinfo-snapshot and open a support case for further investigation of the issue. The support ticket can be opened by emailing "Networking-support@nvidia.com "

Please note that an active support contract would be required for the same. If you do not have a current support contract, please reach out to our Contracts team at networking-contracts@nvidia.com.

Thank you,
Bhargavi

The device will have to be recovered using I2C.
I’m not sure it is field recoverable unless you have the special gear required for it.

It will require to be RMAd most likely.

Thank you. I have made similar attempts, but Mellanox device are still not visible in the system. I have already sent the card to the vendor for repair, and I am waiting for their feedback.

Thank you. Is configuring NUM_OF_PF as 0 a risky operation? If it is, I suggest considering the option of disabling the value 0 for this parameter in the firmware and enforcing that this parameter value must be greater than or equal to 1.

What FW version are you running on the cards?

There should be a protection flow for such scenarios starting xx.31.xxxx branch.