Hi,
I am currently using a server which has 3 connectX6 NICs. We see the following warnings in the dmesg (just one NIC instead of all three):
[7179280.266969] mlx5_core 0000:25:00.1: temp_warn:173:(pid 0): High temperature on sensors with bit set 2 8000000000000000
[7179280.267064] mlx5_core 0000:25:00.0: temp_warn:173:(pid 0): High temperature on sensors with bit set 2 8000000000000000
[7181310.306025] mlx5_core 0000:25:00.0: temp_warn:173:(pid 0): High temperature on sensors with bit set 0 8000000000000000
[7181310.306111] mlx5_core 0000:25:00.1: temp_warn:173:(pid 0): High temperature on sensors with bit set 0 8000000000000000
We are running RoCEv2 with PFC and we see high amount of TX PFC pause frames, with high pause duration being sent from all three NICs not just the one appearing in dmesg and this correlates with the increase in module (transceiver) temperature.
When we increase the cooling of the system we see that the issue is mitigated and we don’t see the backpressure from the server. Interestingly another server with exact same NICs and firmware version, and under similar thermal conditions doesn’t encounter the same behavior.
My questions are the following:
- Can the high temperature on the NIC or transceiver cause a high amount of tx PFC being sent from the NIC?
- How well is the threshold for the dmesg error tuned? Can we observe the high TX PFC behavior even on the NICs which don’t hit the threshold as we are seeing here (no error message for other two NICs but high amount of TX PFC for the other two NICs too).
- Do you have any suggestions on diagnostic commands on the NICs, to see if they are doing some kind of “thermal throttling”? Is this expected?
- Can an incompatibility between cable/transceiver and the NIC also cause this issue?
- What does “sensors with bit set 2 8000000000000000” in the dmesg error indicate?
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX6
Part Number: MCX653106A-ECA_HPE_Ax
Description: HPE InfiniBand HDR100/Ethernet 100Gb 2-port MCX653106A-ECAT QSFP56 x16 Adapter
PSID: MT_0000000453
PCI Device Name: /dev/mst/mt4123_pciconf2
Base MAC: 88e9a4b62514
Versions: Current Available
FW 20.37.1700 N/A
PXE 3.7.0102 N/A
UEFI 14.30.0013 N/A
Status: No matching image found
Device #2:
----------
Device Type: ConnectX6
Part Number: MCX653436A-HDA_HPE_Ax
Description: HPE InfiniBand HDR/Ethernet 200Gb 2-port QSFP56 PCIe4 x16 OCP3 MCX653436A-HDAI Adapter
PSID: MT_0000000593
PCI Device Name: /dev/mst/mt4123_pciconf1
Base MAC: 88e9a4cd615e
Versions: Current Available
FW 20.37.1700 N/A
PXE 3.7.0102 N/A
UEFI 14.30.0013 N/A
Status: No matching image found
Device #3:
----------
Device Type: ConnectX6
Part Number: MCX653106A-ECA_HPE_Ax
Description: HPE InfiniBand HDR100/Ethernet 100Gb 2-port MCX653106A-ECAT QSFP56 x16 Adapter
PSID: MT_0000000453
PCI Device Name: /dev/mst/mt4123_pciconf0
Base MAC: 88e9a4c1cc76
Versions: Current Available
FW 20.37.1700 N/A
PXE 3.7.0102 N/A
UEFI 14.30.0013 N/A
Status: No matching image found