Modify QP error (HCA reset)

I’m having an issue as of yesterday with a system that has a 40GB dual port daughter card in it. The network connections for the 2 ports are showing disconnected. I’m in a home lab with a single unmanaged switched running 2 instances of OpenSM on two separate servers.

The error I’m getting is in the event viewer and it spams repeated until I stop the OpenSM service on the host.

Mellanox ConnectX-2 IPoIB Adapter device reports a “Modify QP error” on qpn #0x58 Status #0xffffffea. Therefore, the HCA Nic will be reset. (The issue is reported in Function CMcast::CompleteJoinMcastWi).

My other 4 40GB IB cards are functioning properly and some of the things I’ve tried:

  1. Restart the OpenSM service on both hosts

  2. reset the daughter card

  3. tried a different set of cables

  4. reset the switch

  5. reinstall the device drivers (4.90)

  6. compared the advanced settings in the driver to the other daughter cards on another host

I’ve attached a snapshot and would appreciate any help.

Thanks

I finally figured this one out …

I have a C6100 with 4 nodes and have the dual port daughter cards installed. When I purchased the server I ran through each node and updated the firmware to 2.10.720 from the 2.7… well one node was missed and that was my issue.

Microsoft base drivers dated from 2013 would show the cards online but RDMA capable, via PowerShell, was false. This immediately prompted me to check the firmware version as I initially had to update the firmware for RDMA to work.

consider this one closed.