The print log “mlx5_core 0000:51:00.6: poll_health:853:(pid 0): device’s health compromised - reached miss count” is a driver warning messages. So this message should not report any fatal error on FW.
But, this “device’s health compromised - reached miss count” should not be the only log appears in your server messages log. In most cases, it will comes together with other log message, like “synd 0x1: firmware internal error” or “print_health_info:466:(pid 0): ext_synd 0x8a02”.
So the suggesetions are:
If there is many firmware error log before/after the “device’s health compromised - reached miss count”, you need to share all those log to us.
If you did update firmware and driver, please do a AC power cycle on this server, and see if the “device’s health compromised - reached miss count” still exist at server boot up.
-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
HW Info:
HwDevId 525
HwRevId 0x0
Flash Info:
Type GD25LBxxx
TotalSize 0x1000000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x10
CmdSet 0x80
QuadEn 0
DummyCycles 15
Flash0.WriteProtected Disabled
JEDEC_ID 0x1840c8
2. set failed even if i use the mst dev
[root@localhost ~]# flint -d 51:00.0 -ocr hw set QuadEn=1
-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
-E- Unknown option “set” for the “Hw” command. you can use query.
[root@localhost ~]# flint -d /dev/mst/mt4119_pciconf0 -ocr hw set QuadEn=1
-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
-E- Unknown option “set” for the “Hw” command. you can use query.
I haved changed the value of “QuadEn” to 1, but it didn’t work as I still can see the error after reboot.
[root@localhost ~]# flint -d 51:00.0 -ocr hw query
-W- Firmware flash cache access is enabled. Running in this mode may cause the firmware to hang.
HW Info:
HwDevId 525
HwRevId 0x0
Flash Info:
Type GD25LBxxx
TotalSize 0x1000000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x10
CmdSet 0x80
QuadEn 1
DummyCycles 15
Flash0.WriteProtected Disabled
JEDEC_ID 0x1840c8
From the dmesg log you sent, I suggest you can try to change this NIC to another PCI slot and reboot the server again.
If problem still exist, and this NIC is still in warrenty, then you can submit a case ticket to our support portal(not forum) to ask for a RMA. (Our engineer should ask a few questions before it begin the RMA process)