How to monitor ASIC and Transceiver temperature for ConnectX on RHEL 9?

Hello NVIDIA Team and Community,

I am currently working with an NVIDIA ConnectX-6 (MT28908) network adapter on a RHEL 9.7 system (Kernel 5.14). I have already compiled and set up the DPDK environment for high-performance testing.

I would like to monitor the real-time temperature of the adapter during high-load stress tests to prevent overheating and ensure stability.

My Questions:

  1. Besides lm_sensors, what is the recommended official NVIDIA tool to monitor the ASIC core temperature in real-time?

  2. How can I monitor the Optical Module (QSFP/Transceiver) temperature if the card is being used by DPDK?

  3. Are there any specific DNF/YUM packages or MFT components I should check to ensure I have all the diagnostic tools (like mlxlink or mxtool)?

Hi,

  1. You can use mget_tmp to get the ASIC temperature of the adapter:
Summary

root@l-csi-ufm3-0418:~# lspci |grep -i Mell
4b:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
4b:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
98:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
98:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
root@l-csi-ufm3-0418:~# mget_temp -d 4b:00.0
67
root@l-csi-ufm3-0418:~# mget_temp -d mlx5_0
67
root@l-csi-ufm3-0418:~#

  1. You can use mlxlink for the transceiver temperature
  2. You can download latest MFT from following page:
    NVIDIA Firmware Tools (MFT)

Thanks,

Suo

Thank you for your answer; it really cleared things up for me.