High temperature on Mellanox cards causes high amount of TX PFCs

Hi,

I am currently using a server which has 3 connectX6 NICs. We see the following warnings in the dmesg (just one NIC instead of all three):

[7179280.266969] mlx5_core 0000:25:00.1: temp_warn:173:(pid 0): High temperature on sensors with bit set 2 8000000000000000
[7179280.267064] mlx5_core 0000:25:00.0: temp_warn:173:(pid 0): High temperature on sensors with bit set 2 8000000000000000
[7181310.306025] mlx5_core 0000:25:00.0: temp_warn:173:(pid 0): High temperature on sensors with bit set 0 8000000000000000
[7181310.306111] mlx5_core 0000:25:00.1: temp_warn:173:(pid 0): High temperature on sensors with bit set 0 8000000000000000

We are running RoCEv2 with PFC and we see high amount of TX PFC pause frames, with high pause duration being sent from all three NICs not just the one appearing in dmesg and this correlates with the increase in module (transceiver) temperature.

When we increase the cooling of the system we see that the issue is mitigated and we don’t see the backpressure from the server. Interestingly another server with exact same NICs and firmware version, and under similar thermal conditions doesn’t encounter the same behavior.

My questions are the following:

  1. Can the high temperature on the NIC or transceiver cause a high amount of tx PFC being sent from the NIC?
  2. How well is the threshold for the dmesg error tuned? Can we observe the high TX PFC behavior even on the NICs which don’t hit the threshold as we are seeing here (no error message for other two NICs but high amount of TX PFC for the other two NICs too).
  3. Do you have any suggestions on diagnostic commands on the NICs, to see if they are doing some kind of “thermal throttling”? Is this expected?
  4. Can an incompatibility between cable/transceiver and the NIC also cause this issue?
  5. What does “sensors with bit set 2 8000000000000000” in the dmesg error indicate?
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX6
  Part Number:      MCX653106A-ECA_HPE_Ax
  Description:      HPE InfiniBand HDR100/Ethernet 100Gb 2-port MCX653106A-ECAT QSFP56 x16 Adapter
  PSID:             MT_0000000453
  PCI Device Name:  /dev/mst/mt4123_pciconf2
  Base MAC:         88e9a4b62514
  Versions:         Current        Available
     FW             20.37.1700     N/A
     PXE            3.7.0102       N/A
     UEFI           14.30.0013     N/A

  Status:           No matching image found

Device #2:
----------

  Device Type:      ConnectX6
  Part Number:      MCX653436A-HDA_HPE_Ax
  Description:      HPE InfiniBand HDR/Ethernet 200Gb 2-port QSFP56 PCIe4 x16 OCP3 MCX653436A-HDAI Adapter
  PSID:             MT_0000000593
  PCI Device Name:  /dev/mst/mt4123_pciconf1
  Base MAC:         88e9a4cd615e
  Versions:         Current        Available
     FW             20.37.1700     N/A
     PXE            3.7.0102       N/A
     UEFI           14.30.0013     N/A

  Status:           No matching image found

Device #3:
----------

  Device Type:      ConnectX6
  Part Number:      MCX653106A-ECA_HPE_Ax
  Description:      HPE InfiniBand HDR100/Ethernet 100Gb 2-port MCX653106A-ECAT QSFP56 x16 Adapter
  PSID:             MT_0000000453
  PCI Device Name:  /dev/mst/mt4123_pciconf0
  Base MAC:         88e9a4c1cc76
  Versions:         Current        Available
     FW             20.37.1700     N/A
     PXE            3.7.0102       N/A
     UEFI           14.30.0013     N/A

  Status:           No matching image found 

Dear Customer,

This is a piece of HPE OME adapter. Please kindly seek assistance from HPE.
BTW, please kindly test with the latest Firmare and MLNX OFED.

Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.