ConnectX-6 Lx #4 device reports an "EQ stuck" on EQn 0x4. Attempting recovery

We have a 6-node Windows Server 2022 Failover cluster

We use 2x Mellanox ConnectX-6 Lx adapters in each server purely for iSCSI connection to our SAN. We’re getting the following event on 2 out of the 6 servers:

ConnectX-6 Lx #4 device reports an “EQ stuck” on EQn 0x4. Attempting recovery.

I can’t seem to link this to any particular workload or action on the servers, and we don’t notice any undesired effect. A previous thread suggests it could be related to CPU utilisation or RSS configuration.

CPU utilisation is generally low (~10-15%) with RSS configured:

MaxProcessors - 4
NumberOfReceiveQueues - 8
Profile - NUMAStatic
RSS processor array is different for each interface

Any suggestions as to what could be causing this would be greatly appreciated.

Hi,

Thanks for your question.
The mentioned message may be caused by CPU load, but you wrote the load is not high.

To understand the potential root cause of this behavior we will need to investigate system logs, and this will require opening of a new support case in Nvidia portal (or sending an email to enterprisesupport@nvidia.com) with all the relevant logs after the issue is observed. Then the case will be handled according to the support entitlement.

Best Regards,
Anatoly