We have recently moved some servers that are using Mellanox ConnectX-5 Ex cards from a DAC back Solution to a Optical Cable based solution with QSFP28 DR1 optics. On our servers with Connect-x EX cards, we are seeing an error “port_module:255:(pid 0): Port module event[error]: module 1, Cable error, Power budget exceeded”, and the ports are not coming up. We know that there is sufficent power to run these cards from the PCI bus. Our servers that run ConnectX-6 cards do not have this issue, and we have been successful with ConnectX-5 (non-EX version) in the past.
We have tried the values of “ADVANCED_POWER_SETTINGS” and “DISABLE_SLOT_POWER_LIMITER” as recommend here - mlx5_core - Cable error / Power budget exceeded. However, that has not made any difference.
We have also tried overriding the “Slot Power Limit Control” based on ConnectX-5 Release notes (https://network.nvidia.com/pdf/firmware/ConnectX5-FW-16_24_4020-release_notes.pdf). This did change the log message “PCIe slot power capability was not advertised.” to “PCIe slot advertised sufficient power (75W).” However, the ports still do not come up.
Any fthoughts on how to fix this issue would be greatly appreciated. Thank you!
Here are the log messages before overriding the “Slot Power Limit Control”:
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 3.730530] mlxfw: loading out-of-tree module taints kernel.
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 3.730530] mlxfw: loading out-of-tree module taints kernel.
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 3.739399] mlxfw: module verification failed: signature and/or required key missing - tainting kernel
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 4.485074] mlx5_core 0000:a1:00.0: firmware version: 16.35.3006
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 4.485110] mlx5_core 0000:a1:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 4.850974] mlx5_core 0000:a1:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 4.851175] mlx5_core 0000:a1:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 4.859631] mlx5_core 0000:a1:00.0: port_module:255:(pid 0): Port module event[error]: module 0,
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 4.859998] mlx5_core 0000:a1:00.0: mlx5_pcie_event:295:(pid 490): PCIe slot power capability was not advertised.
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 4.942492] mlx5_core 0000:a1:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.166178] mlx5_core 0000:a1:00.1: firmware version: 16.35.3006
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.166260] mlx5_core 0000:a1:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.567758] mlx5_core 0000:a1:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.585780] mlx5_core 0000:a1:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.613840] mlx5_core 0000:a1:00.1: port_module:255:(pid 0): Port module event[error]: module 1, Cable error, Power budget exceeded
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.616051] mlx5_core 0000:a1:00.1: mlx5_pcie_event:295:(pid 490): PCIe slot power capability was not advertised.
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.625590] mlx5_core 0000:a1:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.867331] mlx5_core 0000:a1:00.0 enp161s0f0np0: renamed from eth0
Oct 11 23:02:39 prime-or1-cld-comp-14 kernel: [ 5.909141] mlx5_core 0000:a1:00.1 enp161s0f1np1: renamed from eth1
Oct 11 23:02:40 prime-or1-cld-comp-14 kernel: [ 99.958396] mlx5_core 0000:a1:00.1 enp161s0f1np1: Link down
Oct 11 23:02:41 prime-or1-cld-comp-14 kernel: [ 100.761883] mlx5_core 0000:a1:00.0 enp161s0f0np0: Link down
Oct 11 23:02:41 prime-or1-cld-comp-14 kernel: [ 100.775426] mlx5_core 0000:a1:00.0: lag map: port 1:1 port 2:2
Oct 11 23:02:41 prime-or1-cld-comp-14 kernel: [ 100.775435] mlx5_core 0000:a1:00.0: shared_fdb:0 mode:queue_affinity
Here are the log messages before overriding the “Slot Power Limit Control”:
There changes were done with the commonds:
echo "MLNX_RAW_TLV_FILE" > /root/power_conf_tlv.cfg
echo "0x00000004 0x00000088 0x00000000 0xc0000000" >> /root/power_conf_tlv.cfg
mlxconfig -d /dev/mst/mt4121_pciconf0 -f /root/power_conf_tlv.cfg set_raw
Here is the log:
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 3.755310] mlxfw: loading out-of-tree module taints kernel.
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 3.800926] mlxfw: module verification failed: signature and/or required key missing - tainting kernel
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 4.657792] mlx5_core 0000:a1:00.0: firmware version: 16.35.3006
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 4.665761] mlx5_core 0000:a1:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.033921] mlx5_core 0000:a1:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.034118] mlx5_core 0000:a1:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.041733] mlx5_core 0000:a1:00.0: port_module:255:(pid 0): Port module event[error]: module 0, Cable error, Power budget exceeded
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.041991] mlx5_core 0000:a1:00.0: mlx5_pcie_event:304:(pid 491): PCIe slot advertised sufficient power (75W).
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.086344] mlx5_core 0000:a1:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.312150] mlx5_core 0000:a1:00.1: firmware version: 16.35.3006
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.312226] mlx5_core 0000:a1:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.697366] mlx5_core 0000:a1:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.716337] mlx5_core 0000:a1:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.744471] mlx5_core 0000:a1:00.1: port_module:255:(pid 0): Port module event[error]: module 1, Cable error, Power budget exceeded
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.746545] mlx5_core 0000:a1:00.1: mlx5_pcie_event:304:(pid 491): PCIe slot advertised sufficient power (75W).
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 5.756500] mlx5_core 0000:a1:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 6.044173] mlx5_core 0000:a1:00.0 enp161s0f0np0: renamed from eth0
Oct 12 04:32:05 prime-or1-cld-comp-14 kernel: [ 6.073093] mlx5_core 0000:a1:00.1 enp161s0f1np1: renamed from eth1