Hi. Having irregular issues with OID timeouts on WS2022 with ConnectX-5/6 cards.
Windows detects a problem and resets the adapter after 5 timeouts, which isn’t pretty on a busy Hyper-V server.
The events look like this:
“The network interface “Mellanox ConnectX-6 Dx Adapter #3” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets. Reason: The network driver did not respond to an OID request in a timely fashion. This network interface has reset 1 time(s) since it was last initialized.”
“NDIS initiates reset on device Mellanox ConnectX-6 Dx Adapter #3.”
It is listed as a known issue in the WinOF-2 driver (#1336097 at Known Issues - NVIDIA Docs).
So, we wanted to raise the timeout per the instructions, problem is “Max OID time” or anything similar isn’t listed in the registry keys list or anywhere else in the manual or adapter properties in Windows, so we have no idea what/where to change that.
Default: 4|The interval in seconds for the Check- for-Hang mechanism
Note: This registry key is available only when using WinOF-2 v2.0 and later.
Note: As of WinOF-2 v2.10, this key can be changed dynamically. In any case of an illegal input, the value will fall back to the default value and not to the last value used.|
| — | — |
Um, yes, I found that, but what is the “max OID time” value? Or am I misunderstanding the formula? I assumed the “max OID time” is a configurable setting or where am I supposed to find that?
Adding to this, we are also encountering the same issue on each of our Azure Stack HCI Cluster Nodes:
"
The network interface “Mellanox ConnectX-6 Lx Adapter #2” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets. Reason: The network driver detected that its hardware has stopped responding to commands. This network interface has reset 7465 time(s) since it was last initialized.
"
As markos mentioned previously, what exactly is the “max OID time” value and where do we find and configure it? The documentation you provided does not provide information around that.
Hey. We had the same issue for a while and we’ve found a way to fix it. Our ConnectX6-Lx 's are Dell branded ones, and thus, have Dell firmware. For us, Dell just recently released firmware version 26.39.10.02 for them which fixed our issues.
Unfortunately, upgrading to latest firmware on our X-5 EN and X-6 DX cards hasn’t solved the issue. We just received another timeout on a server that was upgraded ~1 week ago.
I thought we were the only ones having this problem until I came across this thread. @markos were you ever able to find a solution? Normally this wouldn’t be big deal, but in this instance Hyper-V doesn’t seem to want to do it’s job and when we get these disconnects, it really makes a mess of the effected host.
We’re currently running Dell branded ConnectX6 Dx cards with the latest firmware (22.41.10.00) in Dell R740xd servers with Windows Server 2022 (all the latest drivers and firmware as of today 3/28/2025) in a Failover cluster with Hyper-V. The issue seems to be either time or bandwidth related, and really only shows up when doing Live Migrations. If we wait long enough between Live Migrations (generally between a week or two) and then pause/drain a host, the receiving host experiences the disconnects as described in @markos OP.
We are aware of WinOF-2 driver known issue #1336097, and have tried the fix described, but that didn’t work for us. We tried setting it to 30. That said, there is zero guidance we can find on what the “Max OID time” default value is and how to properly set the registry key value CheckForHangTOInSeconds due to the lack of that missing information.
We have 2 Hyper-V clusters at different locations in this situation. Right now, we just drain/pause and reboot each host roughly once a week to avoid the problem. If we wait longer, then the disconnects happen and it creates quite the mess. If anyone in this thread has found a solution and just not updated it here, it would be helpful to know what was done.
Hi banduraj, unfortunately we haven’t found any solution. Dell support was useless and we’re still getting timeouts randomly during normal operation - it might be connected to some sort of cpu/network load, but we haven’t identified the cause. For now we’re monitoring these events and when it happens, we run a DR plan and migrate VMs off the host.
For reference, we’re now on FW 22.41.10.00 (same as you) and driver 24.04.03, which are both the latest available.
Thanks for the response. Sadly, Dell support has been completely useless for us as well. We actually switched to these Mellanox/Nvidia cards because we were having different issues with our previous Qlogic cards. We thought that Mellanox/Nvidia were the market leaders in this space, but I guess they all have major issues.
If we come across a solution, I’ll be sure to report back here.