ConnectX-5/6 OID timeouts

Hi. Having irregular issues with OID timeouts on WS2022 with ConnectX-5/6 cards.
Windows detects a problem and resets the adapter after 5 timeouts, which isn’t pretty on a busy Hyper-V server.

The events look like this:
“The network interface “Mellanox ConnectX-6 Dx Adapter #3” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets. Reason: The network driver did not respond to an OID request in a timely fashion. This network interface has reset 1 time(s) since it was last initialized.”

“NDIS initiates reset on device Mellanox ConnectX-6 Dx Adapter #3.”

It is listed as a known issue in the WinOF-2 driver (#1336097 at Known Issues - NVIDIA Docs).
So, we wanted to raise the timeout per the instructions, problem is “Max OID time” or anything similar isn’t listed in the registry keys list or anywhere else in the manual or adapter properties in Windows, so we have no idea what/where to change that.

AS RN list you need change reg key “2 x CheckForHangTOInSeconds > max OID time”

https://docs.nvidia.com/networking/display/winof2v2310/configuring+the+driver+registry+keys

|CheckForHangTOInSeconds|REG_DWORD|[0 – MAX_ULONG]

Default: 4|The interval in seconds for the Check- for-Hang mechanism

Note: This registry key is available only when using WinOF-2 v2.0 and later.

Note: As of WinOF-2 v2.10, this key can be changed dynamically. In any case of an illegal input, the value will fall back to the default value and not to the last value used.|
| — | — |

Um, yes, I found that, but what is the “max OID time” value? Or am I misunderstanding the formula? I assumed the “max OID time” is a configurable setting or where am I supposed to find that?

Adding to this, we are also encountering the same issue on each of our Azure Stack HCI Cluster Nodes:
"
The network interface “Mellanox ConnectX-6 Lx Adapter #2” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets. Reason: The network driver detected that its hardware has stopped responding to commands. This network interface has reset 7465 time(s) since it was last initialized.
"
As markos mentioned previously, what exactly is the “max OID time” value and where do we find and configure it? The documentation you provided does not provide information around that.

Hey. We had the same issue for a while and we’ve found a way to fix it. Our ConnectX6-Lx 's are Dell branded ones, and thus, have Dell firmware. For us, Dell just recently released firmware version 26.39.10.02 for them which fixed our issues.

An identical issue was experienced in our environment as well. Confirming that the upgrade to firmware version 26.39.10.2 fixed the issue immediately.

Unfortunately, upgrading to latest firmware on our X-5 EN and X-6 DX cards hasn’t solved the issue. We just received another timeout on a server that was upgraded ~1 week ago.

I thought we were the only ones having this problem until I came across this thread. @markos were you ever able to find a solution? Normally this wouldn’t be big deal, but in this instance Hyper-V doesn’t seem to want to do it’s job and when we get these disconnects, it really makes a mess of the effected host.

We’re currently running Dell branded ConnectX6 Dx cards with the latest firmware (22.41.10.00) in Dell R740xd servers with Windows Server 2022 (all the latest drivers and firmware as of today 3/28/2025) in a Failover cluster with Hyper-V. The issue seems to be either time or bandwidth related, and really only shows up when doing Live Migrations. If we wait long enough between Live Migrations (generally between a week or two) and then pause/drain a host, the receiving host experiences the disconnects as described in @markos OP.

We are aware of WinOF-2 driver known issue #1336097, and have tried the fix described, but that didn’t work for us. We tried setting it to 30. That said, there is zero guidance we can find on what the “Max OID time” default value is and how to properly set the registry key value CheckForHangTOInSeconds due to the lack of that missing information.

We have 2 Hyper-V clusters at different locations in this situation. Right now, we just drain/pause and reboot each host roughly once a week to avoid the problem. If we wait longer, then the disconnects happen and it creates quite the mess. If anyone in this thread has found a solution and just not updated it here, it would be helpful to know what was done.

Thank you.

Hi banduraj, unfortunately we haven’t found any solution. Dell support was useless and we’re still getting timeouts randomly during normal operation - it might be connected to some sort of cpu/network load, but we haven’t identified the cause. For now we’re monitoring these events and when it happens, we run a DR plan and migrate VMs off the host.
For reference, we’re now on FW 22.41.10.00 (same as you) and driver 24.04.03, which are both the latest available.

Thanks for the response. Sadly, Dell support has been completely useless for us as well. We actually switched to these Mellanox/Nvidia cards because we were having different issues with our previous Qlogic cards. We thought that Mellanox/Nvidia were the market leaders in this space, but I guess they all have major issues.

If we come across a solution, I’ll be sure to report back here.

@markos can you confirm or deny that you are using Hyper-V replication in your setup that sees these issues? I have had someone tell me that this is associated with the problem and disabling replication could resolve the issue.

Thank you.

No, we’re not using replication at all.

Hey @markos, I have came across some additional information I wanted to pass along in hopes it may help.

While reviewing the logs after a recent event where we had additional disconnects, I found some event messages I had previously missed because they were marked as Informational. Generally, I am only looking at Warning, Error and Critical events. That said, in every host across both clusters that we have seen the disconnects, we always get the following entries in the System Event log:

Warning, 10400, NDIS, The network interface “Mellanox ConnectX-6 Dx Adapter” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets. Reason: The network driver did not respond to an OID request in a timely fashion. This network interface has reset 1 time(s) since it was last initialized.

Warning, 45, mlx5, NDIS initiates reset on device ConnectX-6 Dx.

Information, 362, mlx5, ConnectX-6 Dx: OID Statistics:
Last OID: 0x10204, took: 9434910 micro second.
Most time consuming OID 0x10204, took: 23825807 micro second.

I used this PowerShell command I put together to check the hosts remotely:

Get-EventLog -ComputerName host1,host2,host3 -LogName System -Source @("Microsoft-Windows-NDIS","mlx5") -InstanceId @(10400, 2147942445, 1074200938) -ErrorAction SilentlyContinue | Select-Object MachineName,TimeGenerated,EntryType,EventID,Source,Message | Out-GridView

Where host1,host2,host3 is a list of Hyper-V hosts to query. Adjust accordingly, or remove the -ComputerName parameter and run locally on the hosts.

This last Information, 362, mlx5 entry is the one I missed prior that I noticed recently. This likely provides some information I don’t believe we had before, and it would be helpful to know if you are seeing these as well. And if so, what information do you see in the message?

For us, the Last OID is always 0x10204 and so far, the Most time consuming OID has also been 0x10204. Additionally, across all events across both clusters, the longest running OID time has been 65927558 microseconds.

As we both know from Known Issues - NVIDIA Docs issues 2683075 and 1336097, we need to adjust the CheckForHangTOInSeconds key as defined in Configuring the Driver Registry Keys - NVIDIA Docs to (2 * CheckForHangTOInSeconds > Max OID time). In my case, 65927558 microseconds is the longest Max OID time we have seen. So, 65927558 microseconds is equal to 65.93 seconds, rounded up to 66 seconds. This gives us (2 * 66) = 132. With that information, I believe that (for us, at least) we would need to set the CheckForHangTOInSeconds to 132 to attempt to alleviate the disconnect issues we have been seeing.

Another thing I want to point out. As mentioned, in all 362, mlx5 entries, the OID mentioned has been 0x10204 for us. My research shows this to be OID_GEN_RECEIVE_SCALE_PARAMETERS as identified in the NDIS OID defs mentioned here winsdk-10/Include/10.0.14393.0/shared/ntddndis.h at master · tpn/winsdk-10 · GitHub. OID_GEN_RECEIVE_SCALE_PARAMETERS is related to RSS as part of NDIS v6+ as mentioned here Receive Side Scaling Version 2 (RSSv2) - Windows drivers | Microsoft Learn.

You can verify your NDIS version using this PowerShell command from one of the hosts:

Get-NetAdapter | Select-Object Name, NdisVersion

We’re currently at NDIS v6.85, which should be the same for you on Server 2022 with these NICs and latest drivers.

With all that information, that presents a couple of options:

  1. Change the CheckForHangTOInSeconds to 132 (higher or lower, depending on the information from the 362, mlx5 events. It will be different for you)
  2. Disable RSS (likely using Disable-NetAdapterRss)

So far, we haven’t tried either of these options yet. Both present their own set of problems. The CheckForHangTOInSeconds setting has a default of 4. So, changing it to 132 seems extremely high. There is no guidance that I can find on what an acceptable range is, so I’m mildly afraid to go there just yet. And, disabling RSS could be troublesome since we make use of RDMA using RoCEv2, and the performance penalty could also be too high.

We have recently just updated our drivers and firmware to the latest released late last month. If this issue comes back, Dell will be escalating it to Microsoft. Considering what I have found, I am starting to wonder if this isn’t a problem in Server 2022, specifically with RSS. It may just be that the NIC issues we’re seeing is how the driver is handling what Microsoft should be completing in a timely fashion, but clearly isn’t.

Again, let me know if you guys see those same 362, mlx5 events and what information they contain. Also, please let me know if any of this is helpful and what actions you guys decide take, if any, and how it goes. I’ll continue to update here with whatever new information I come across and how things go with Dell and/or Microsoft.

FYI, I just learned you can use the follow command to view OID statistics of your ConnectX cards. This will show what I had already previously explained, and also shows that OID_GEN_RECEIVE_SCALE_PARAMETERS frequently tops the list (at least for us) for Max Time and Average Time.

Mlx5Cmd.exe -OidStats

Writing to a file and viewing that file makes it easier to read.

Mlx5Cmd.exe -OidStats > c:\oid_stats.txt

@markos Just so you know, my previous post sent me down the path of researching RSS and RSSv2 in Windows Server 2022. As far as MS documentation is concerned, they think it’s on by default with 2022. As far as Nvidia is concerned, it’s not. You must set the registry key to enable it. See my post here: ConnectX-6 Dx with RRSv2

With that in mind, I enabled this key for all Connect-X 6 Dx NIC’s on 4 out of the 5 Hyper-V cluster nodes for one of our clusters.

This is what running the Mlx5Cmd.exe -OidStats looks like on our HyperV node 1 after enabling the RssV2 registry key for both NIC’s and rebooting:

This is what running the Mlx5Cmd.exe -OidStats looks like on our HyperV node 2 without enabling the RssV2 registry key for both NIC’s:

This may not fix the issue, but I’m extremely hopeful. If this doesn’t work, turning off RSS entirely may be the next step. Generally, the NIC resets show up for us about 2 weeks or so after our last reboot. In 2 weeks, I’ll be testing again to see if the problem still exists. I would expect it to exist on node 5 of our cluster, and the other 4 nodes to not have the problem.

I’ll report back in 2 weeks after I see how this goes.

Hi @banduraj, not ignoring you, just swamped with work. We’re still getting a reset like every couple weeks, just dealing with more pressing iDRAC-related issues :)
Re the OID stats - I’ve looked at/found those back when I created this thread, we’re also getting 0x10204. As you’ve noticed yourself, some of the maximums are very large and I was reluctant to set it to such high values without guidance from NVIDIA/Mellanox.

Nice find regarding the 0x10204, I haven’t dug that deep. We’re also on 6.85 NDIS version.

Re disabling RSS - we’re also not keen on that, as it’s part of our performance optimizations and it’s hard to get statistically impactful data, since the disconnects happen fairly randomly and infrequently.

Let me know if enabling RSSv2 has any impact on your environment.

Testing this past weekend did not go as I had hoped. The cluster nodes that have RssV2 turned on still experienced the NIC resets. Only this time, the last OID reported was 0x010214 which is OID_GEN_RECEIVE_SCALE_PARAMETERS_V2. Which is unfortunate.

I ended up leaving RssV2 on and enabled it on the 5th cluster node as well. It is a performance improvement. I also went ahead and set CheckForHangTOInSeconds to 132. As I had mentioned previously, that was the time I figured based on the longest running OID.

Will see what happens in a couple weeks again.

As mentioned before, I changed the CheckForHangTOInSeconds to 132 on all 5 nodes of a cluster I have been testing on 2 weeks ago, and this morning I tested moving everything around and had no problems. Normally, the NICs would reset and everything would melt down after 2 weeks of no host reboots.

I am not rebooting the hosts and waiting another 2 weeks to see what happens.

As it stands now, setting CheckForHangTOInSeconds to what I think is a high number doesn’t seem to have any additional downsides and the upsides is no NIC resets.

Will report back in 2 weeks.

Hey, do you happen to have an update for this issue? I am experiencing a similar issue with my 8 node Azure Local cluster. Getting 10400’s and then the cluster is unstable. I have a ticket open with MS but they are still looking into it.

Sadly there isn’t a good solution to this problem. I have been back and fourth with Dell on getting it solved, but the best I got as of now is a partial workaround. This issue only effects us when we do Live Migrations, so we only get the NIC resets and following disconnects at that time after the servers/hosts have been running for some days.

Dell’s “solution” was to set the CheckForHangTOInSeconds registry key to a sufficiently high number as to not have the long running OID times cause the resets. In the end, we set CheckForHangTOInSeconds to a value of 600. This causes the NICs to not reset when OID_GEN_RECEIVE_SCALE_PARAMETERS runs for a really long time. However, the downside of that is, which ever VM’s get stuck in the Live Migration process waiting on that OID, now could get restarted by the cluster for hanging for a long time while waiting to migrate. That’s not good, but it’s better than the cluster puking.

I explained this to Dell, who now says that Nvidia told them that this bug won’t be getting fixed. They have now started working with MS to attack this as a cluster issue.

I am still sending logs and whatnot to Dell to see if we can get any other solution in place, but it’s likely that we won’t see this problem to a good solution. We are buying all new servers with all new NICs and the plan is to resolve it by switching to another adapter brand that hopefully doesn’t have this bug. We need new servers anyway since the current generation of servers don’t officially support Windows Server 2025.

If you read through this thread and the issue you’re seeing matches these problems exactly, you can try and set that CheckForHangTOInSeconds registry value to 600 and see if it helps your problem.