Mlx5_core poll_health raise an error: device's health compromised - reached miss count

After I create several VF on ConnectX5 Adapter port 0, I got the following system message:
[ 1810.527156] mlx5_core 0000:51:00.3: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1811.487131] mlx5_core 0000:51:00.5: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1812.767131] mlx5_core 0000:51:00.6: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1812.831130] mlx5_core 0000:51:01.0: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1812.841027] mlx5_core 0000:51:00.7: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1815.007129] mlx5_core 0000:51:01.1: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1815.519130] mlx5_core 0000:51:01.3: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1816.159129] mlx5_core 0000:51:01.2: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1816.415130] mlx5_core 0000:51:01.6: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1816.543131] mlx5_core 0000:51:01.4: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1817.119130] mlx5_core 0000:51:01.5: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1818.271130] mlx5_core 0000:51:01.7: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1819.551131] mlx5_core 0000:51:02.0: poll_health:853:(pid 0): device’s health compromised - reached miss count
[ 1819.561031] mlx5_core 0000:51:02.1: poll_health:853:(pid 0): device’s health compromised - reached miss count

What are those errors mean and impact ?

I update the firmware and driver but it seems not help.

Firmware version: 16.34.1002
Driver version: 5.7-1.0.2.0
OS: RHEL 8.3

Hi,

The print log “mlx5_core 0000:51:00.6: poll_health:853:(pid 0): device’s health compromised - reached miss count” is a driver warning messages. So this message should not report any fatal error on FW.

But, this “device’s health compromised - reached miss count” should not be the only log appears in your server messages log. In most cases, it will comes together with other log message, like “synd 0x1: firmware internal error” or “print_health_info:466:(pid 0): ext_synd 0x8a02”.

So the suggesetions are:

  1. If there is many firmware error log before/after the “device’s health compromised - reached miss count”, you need to share all those log to us.

  2. If you did update firmware and driver, please do a AC power cycle on this server, and see if the “device’s health compromised - reached miss count” still exist at server boot up.

Longran Wei
Nvidia Support Team

hi,
Thanks for reply, one of my system’s log is attached.

and I also find these are same error reported on my vf port that just created .

dmesg.zip (35.4 KB)

Hi lvzhipeng,

Please try to execute this command: “flint -d 51:00.0 -ocr hw query”.

If you can see the “QuadEn 0”,
You can try to run “flint -d 51:00.0 -ocr hw set QuadEn=1”

Then you can try to reboot and see if the error still exist or not. Please let me know the test result.

Longran Wei
Nvidia Support Team

The “QuadEn” value is 0, but the set opration is not supported.

  1. query
    [root@localhost ~]# flint -d 51:00.0 -ocr hw query

-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
HW Info:
HwDevId 525
HwRevId 0x0
Flash Info:
Type GD25LBxxx
TotalSize 0x1000000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x10
CmdSet 0x80
QuadEn 0
DummyCycles 15
Flash0.WriteProtected Disabled
JEDEC_ID 0x1840c8
2. set failed even if i use the mst dev
[root@localhost ~]# flint -d 51:00.0 -ocr hw set QuadEn=1

-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
-E- Unknown option “set” for the “Hw” command. you can use query.
[root@localhost ~]# flint -d /dev/mst/mt4119_pciconf0 -ocr hw set QuadEn=1

-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
-E- Unknown option “set” for the “Hw” command. you can use query.

I haved changed the value of “QuadEn” to 1, but it didn’t work as I still can see the error after reboot.
[root@localhost ~]# flint -d 51:00.0 -ocr hw query

-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
HW Info:
HwDevId 525
HwRevId 0x0
Flash Info:
Type GD25LBxxx
TotalSize 0x1000000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x10
CmdSet 0x80
QuadEn 1
DummyCycles 15
Flash0.WriteProtected Disabled
JEDEC_ID 0x1840c8

dmesg_with_QuadEn.log (5.3 KB)

Hi lvzhipeng,

Thanks for your update.

  1. From the dmesg log you sent, I suggest you can try to change this NIC to another PCI slot and reboot the server again.

  2. If problem still exist, and this NIC is still in warrenty, then you can submit a case ticket to our support portal(not forum) to ask for a RMA. (Our engineer should ask a few questions before it begin the RMA process)

Thanks!
Longran Wei
Nvidia Support Team