The latest firmware version for the CX-6 card is 20.26.1040(Mellanox GA). For HDR, we constantly work on improvements and it is recommended to stay on the latest version available. Thus, it would be great if you could upgrade the FW to latest version. If the PSID of the card is cray specific and not starting with “MT_”, it would be great if you could validate internally on the latest supported and tested by Cray. Also, please make sure you set your FAN speed of the node to max.
Please let me know if upgrading the FW resolves the issue.
With respect to your question related to checking the current temperature, you need to install Mellanox Firmware Tools on the affected node. You can download the latest version through the following link → Mellanox Firmware Tools (MFT)
Output example:
lspci -d 15b3:
04:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
04:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
06:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
06:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]