How can I check the current temperature and set threshold of MLX_5(Connectx-6)?

Dear Mellanox Support team,

The following message is recorded in /var/log/messages.

Dec 3 18:47:13 hgpu009 kernel: mlx5_core 0000: 3b: 00.0: mlx5_temp_warning_event: 564: (pid 0): High temperature on sensors with bit set 0 0

?It seems that the following times have been recorded in the past month.

[root @ hbcm01 log] # grep “High temperature” messages * | grep gpu | wc -l

77

[root @ hbcm01 log] # grep “High temperature” messages * | grep cpu | wc -l

4

?

How can I check the current temperature and set threshold?

The IB HCAs are installed 2VU GPU servers and 4N2U servers.

CS-MLNX-N3HR50-X16-C6-1P CONNECTX-6 VPI ADAPTER CARD, 1 (00GB/S (HDR100, EDR INFINIBAND) FW Version:20.25.7020

Thank you,

Kazuki

Hi Kazuki,

The latest firmware version for the CX-6 card is 20.26.1040(Mellanox GA). For HDR, we constantly work on improvements and it is recommended to stay on the latest version available. Thus, it would be great if you could upgrade the FW to latest version. If the PSID of the card is cray specific and not starting with “MT_”, it would be great if you could validate internally on the latest supported and tested by Cray. Also, please make sure you set your FAN speed of the node to max.

Please let me know if upgrading the FW resolves the issue.

With respect to your question related to checking the current temperature, you need to install Mellanox Firmware Tools on the affected node. You can download the latest version through the following link → Mellanox Firmware Tools (MFT)

Output example:

lspci -d 15b3:

04:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

04:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

06:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

06:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

mget_temp -d 06:00.1

70

Thanks,

Namrata.

Dear Mellanox support,

I couldn’t find any information about the ConnectX-6 temperature threshold in the range I checked the homepage and BIOS settings.

If there is any document or confirmation command that describes the threshold setting, please tell me.

Thank you,

Kazuki