ConnectX-5/6 OID timeouts

markos · January 16, 2024, 2:11pm

Hi. Having irregular issues with OID timeouts on WS2022 with ConnectX-5/6 cards.
Windows detects a problem and resets the adapter after 5 timeouts, which isn’t pretty on a busy Hyper-V server.

The events look like this:
“The network interface “Mellanox ConnectX-6 Dx Adapter #3” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets. Reason: The network driver did not respond to an OID request in a timely fashion. This network interface has reset 1 time(s) since it was last initialized.”

“NDIS initiates reset on device Mellanox ConnectX-6 Dx Adapter #3.”

It is listed as a known issue in the WinOF-2 driver (#1336097 at Known Issues - NVIDIA Docs).
So, we wanted to raise the timeout per the instructions, problem is “Max OID time” or anything similar isn’t listed in the registry keys list or anywhere else in the manual or adapter properties in Windows, so we have no idea what/where to change that.

xiaofengl · January 17, 2024, 9:24am

AS RN list you need change reg key “2 x CheckForHangTOInSeconds > max OID time”

https://docs.nvidia.com/networking/display/winof2v2310/configuring+the+driver+registry+keys

|CheckForHangTOInSeconds|REG_DWORD|[0 – MAX_ULONG]

Default: 4|The interval in seconds for the Check- for-Hang mechanism

Note: This registry key is available only when using WinOF-2 v2.0 and later.

Note: As of WinOF-2 v2.10, this key can be changed dynamically. In any case of an illegal input, the value will fall back to the default value and not to the last value used.|
| — | — |

markos · January 17, 2024, 9:29am

Um, yes, I found that, but what is the “max OID time” value? Or am I misunderstanding the formula? I assumed the “max OID time” is a configurable setting or where am I supposed to find that?

ict6 · March 21, 2024, 10:51pm

Adding to this, we are also encountering the same issue on each of our Azure Stack HCI Cluster Nodes:
"
The network interface “Mellanox ConnectX-6 Lx Adapter #2” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets. Reason: The network driver detected that its hardware has stopped responding to commands. This network interface has reset 7465 time(s) since it was last initialized.
"
As markos mentioned previously, what exactly is the “max OID time” value and where do we find and configure it? The documentation you provided does not provide information around that.

ict6 · April 17, 2024, 1:50am

Hey. We had the same issue for a while and we’ve found a way to fix it. Our ConnectX6-Lx 's are Dell branded ones, and thus, have Dell firmware. For us, Dell just recently released firmware version 26.39.10.02 for them which fixed our issues.