HI,
I am experiencing an issue with the ConnectX-6 Dx NIC after installing it on an HP z6 workstation.
Initially, the NIC is recognized by the system and appears correctly when using commands such as lspci and ip. However, after a short period of time, the NIC becomes unresponsive and is no longer detected by the operating system. The syslog showed messages indicating high temperature and that the PCI slot was unavailable.
Could you please confirm whether this issue may be caused by the NIC initiating a self-shutdown due to excessive temperature?
Additionally, I noticed that there is a delay of several minutes between the system detecting a high temperature and the PCI slot becoming unavailable. Could you please confirm whether this is expected behavior based on the NIC’s thermal protection mechanism?
Below are the relevant entries from the system log and the output of the lspci command:
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: poll_health:1099:(pid 0): device's health compromised - reached miss count
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:498:(pid 0): Health issue observed, High temperature, severity(2) CRITICAL:
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:502:(pid 0): assert_var[0] 0x00000073
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:502:(pid 0): assert_var[1] 0x00000073
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:502:(pid 0): assert_var[2] 0x00000000
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:502:(pid 0): assert_var[3] 0x00000000
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:502:(pid 0): assert_var[4] 0x00000000
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:502:(pid 0): assert_var[5] 0x00000000
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:504:(pid 0): assert_exit_ptr 0x214a25fc
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:505:(pid 0): assert_callra 0x214a2540
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:507:(pid 0): fw_ver 22.43.3608
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:508:(pid 0): time 1758849532
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:509:(pid 0): hw_id 0x00000212
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:510:(pid 0): rfr 0
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:511:(pid 0): severity 2 (CRITICAL)
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:512:(pid 0): irisc_index 0
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:514:(pid 0): synd 0x10: High temperature
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:515:(pid 0): ext_synd 0x0000
Sep 26 10:19:01 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:516:(pid 0): raw fw_ver 0x162b0e18
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: poll_health:1099:(pid 0): device's health compromised - reached miss count
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:498:(pid 0): Health issue observed, High temperature, severity(2) CRITICAL:
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:502:(pid 0): assert_var[0] 0x00000073
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:502:(pid 0): assert_var[1] 0x00000073
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:502:(pid 0): assert_var[2] 0x00000000
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:502:(pid 0): assert_var[3] 0x00000000
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:502:(pid 0): assert_var[4] 0x00000000
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:502:(pid 0): assert_var[5] 0x00000000
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:504:(pid 0): assert_exit_ptr 0x214a25fc
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:505:(pid 0): assert_callra 0x214a2540
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:507:(pid 0): fw_ver 22.43.3608
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:508:(pid 0): time 1758849532
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:509:(pid 0): hw_id 0x00000212
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:510:(pid 0): rfr 0
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:511:(pid 0): severity 2 (CRITICAL)
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:512:(pid 0): irisc_index 0
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:514:(pid 0): synd 0x10: High temperature
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:515:(pid 0): ext_synd 0x0000
Sep 26 10:19:02 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:516:(pid 0): raw fw_ver 0x162b0e18
Sep 26 10:20:00 tower11 systemd[1]: Starting system activity accounting tool...
Sep 26 10:20:00 tower11 systemd[1]: sysstat-collect.service: Succeeded.
Sep 26 10:20:00 tower11 systemd[1]: Started system activity accounting tool.
Sep 26 10:23:20 tower11 kernel: mlx5_core 0000:5e:00.1: poll_health:1083:(pid 0): Fatal error 1 detected
Sep 26 10:23:20 tower11 kernel: mlx5_core 0000:5e:00.1: print_health_info:491:(pid 0): PCI slot is unavailable
Sep 26 10:23:20 tower11 kernel: mlx5_core 0000:5e:00.1: mlx5_pcie_event:296:(pid 164): PCIe slot power capability was not advertised.
Sep 26 10:23:21 tower11 kernel: mlx5_core 0000:5e:00.0: poll_health:1083:(pid 0): Fatal error 1 detected
Sep 26 10:23:21 tower11 kernel: mlx5_core 0000:5e:00.0: print_health_info:491:(pid 0): PCI slot is unavailable
Sep 26 10:23:21 tower11 kernel: mlx5_core 0000:5e:00.0: mlx5_pcie_event:296:(pid 163): PCIe slot power capability was not advertised.
Sep 26 10:23:24 tower11 kernel: mlx5_core 0000:5e:00.1: mlx5_crdump_collect:51:(pid 3189): crdump: failed to lock vsc gw err -16
Sep 26 10:23:24 tower11 kernel: mlx5_core 0000:5e:00.1: mlx5_health_try_recover:347:(pid 3189): handling bad device here
Sep 26 10:23:24 tower11 kernel: mlx5_core 0000:5e:00.1: mlx5_error_sw_reset:241:(pid 3189): start
Sep 26 10:23:25 tower11 kernel: mlx5_core 0000:5e:00.0: mlx5_crdump_collect:51:(pid 3244): crdump: failed to lock vsc gw err -16
Sep 26 10:23:25 tower11 kernel: mlx5_core 0000:5e:00.0: mlx5_health_try_recover:347:(pid 3244): handling bad device here
Sep 26 10:23:25 tower11 kernel: mlx5_core 0000:5e:00.0: mlx5_error_sw_reset:241:(pid 3244): start
Sep 26 10:23:28 tower11 kernel: mlx5_core 0000:5e:00.1: NIC IFC still 7 after 4000ms.
Sep 26 10:23:28 tower11 kernel: mlx5_core 0000:5e:00.1: mlx5_error_sw_reset:278:(pid 3189): end
Sep 26 10:23:28 tower11 kernel: mlx5_core 0000:5e:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
Sep 26 10:23:28 tower11 kernel: mlx5_core 0000:5e:00.1: mlx5_wait_for_pages:919:(pid 3189): Skipping wait for vf pages stage
Sep 26 10:23:28 tower11 kernel: mlx5_core 0000:5e:00.1: mlx5_wait_for_pages:919:(pid 3189): Skipping wait for vf pages stage
Sep 26 10:23:29 tower11 kernel: mlx5_core 0000:5e:00.0: NIC IFC still 7 after 4000ms.
Sep 26 10:23:29 tower11 kernel: mlx5_core 0000:5e:00.0: mlx5_error_sw_reset:278:(pid 3244): end
5e:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
5e:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core