MCX516A-CDAT insufficient power message on one out of 13 identical servers


i am having a weird problem for about 2 weeks now. we have 13 mostly-identical, somewhat older supermicro servers (SYS-1028U-TR4+). about half the boards have a newer hardware revision, but otherwise, they are identical; BIOS version, IPMI FW etc all the same. they are running VMware ESXi, 8.0.2, 22380479.

each host has 1 dual port card and both ports are in active use. a few years ago we used passive DACs which had reliability issues, so we swapped to 100GBASE-CWDM4 modules which worked flawlessly for around 2-3 years now.

however, recently, in exactly one server only, the card refuses to bring up the second link. the only thing that stands out from vmkernel.log is this message:

2024-04-11T12:48:08.913Z Wa(180) vmkwarning: cpu22:2097687)WARNING: <NMLX_WRN> WARN: Detected insufficient power on the PCIe slot (27W).

however, i am not seeing a message like “Cable error, One or more network ports have been powered down due to insufficient/unadvertised power on the PCIe slot” as suggested in Troubleshooting - NVIDIA Docs .

therefore, i am not even sure if that is the actual error or not. but we tried

  • swapping the card to another (same model)
  • swapping the transceivers
  • booting without transceivers, adding them later
    … none of which made a difference.

is there any definitive way to check whether the second port is actually down because of “the power issue” and not something else?

thanks & regards,