Hey @markos, I have came across some additional information I wanted to pass along in hopes it may help.
While reviewing the logs after a recent event where we had additional disconnects, I found some event messages I had previously missed because they were marked as Informational. Generally, I am only looking at Warning, Error and Critical events. That said, in every host across both clusters that we have seen the disconnects, we always get the following entries in the System Event log:
Warning, 10400, NDIS, The network interface “Mellanox ConnectX-6 Dx Adapter” has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets. Reason: The network driver did not respond to an OID request in a timely fashion. This network interface has reset 1 time(s) since it was last initialized.
Warning, 45, mlx5, NDIS initiates reset on device ConnectX-6 Dx.
Information, 362, mlx5, ConnectX-6 Dx: OID Statistics:
Last OID: 0x10204, took: 9434910 micro second.
Most time consuming OID 0x10204, took: 23825807 micro second.
I used this PowerShell command I put together to check the hosts remotely:
Get-EventLog -ComputerName host1,host2,host3 -LogName System -Source @("Microsoft-Windows-NDIS","mlx5") -InstanceId @(10400, 2147942445, 1074200938) -ErrorAction SilentlyContinue | Select-Object MachineName,TimeGenerated,EntryType,EventID,Source,Message | Out-GridView
Where host1,host2,host3 is a list of Hyper-V hosts to query. Adjust accordingly, or remove the -ComputerName parameter and run locally on the hosts.
This last Information, 362, mlx5 entry is the one I missed prior that I noticed recently. This likely provides some information I don’t believe we had before, and it would be helpful to know if you are seeing these as well. And if so, what information do you see in the message?
For us, the Last OID is always 0x10204 and so far, the Most time consuming OID has also been 0x10204. Additionally, across all events across both clusters, the longest running OID time has been 65927558 microseconds.
As we both know from Known Issues - NVIDIA Docs issues 2683075 and 1336097, we need to adjust the CheckForHangTOInSeconds key as defined in Configuring the Driver Registry Keys - NVIDIA Docs to (2 * CheckForHangTOInSeconds > Max OID time). In my case, 65927558 microseconds is the longest Max OID time we have seen. So, 65927558 microseconds is equal to 65.93 seconds, rounded up to 66 seconds. This gives us (2 * 66) = 132. With that information, I believe that (for us, at least) we would need to set the CheckForHangTOInSeconds to 132 to attempt to alleviate the disconnect issues we have been seeing.
Another thing I want to point out. As mentioned, in all 362, mlx5 entries, the OID mentioned has been 0x10204 for us. My research shows this to be OID_GEN_RECEIVE_SCALE_PARAMETERS as identified in the NDIS OID defs mentioned here winsdk-10/Include/10.0.14393.0/shared/ntddndis.h at master · tpn/winsdk-10 · GitHub. OID_GEN_RECEIVE_SCALE_PARAMETERS is related to RSS as part of NDIS v6+ as mentioned here Receive Side Scaling Version 2 (RSSv2) - Windows drivers | Microsoft Learn.
You can verify your NDIS version using this PowerShell command from one of the hosts:
Get-NetAdapter | Select-Object Name, NdisVersion
We’re currently at NDIS v6.85, which should be the same for you on Server 2022 with these NICs and latest drivers.
With all that information, that presents a couple of options:
- Change the CheckForHangTOInSeconds to 132 (higher or lower, depending on the information from the 362, mlx5 events. It will be different for you)
- Disable RSS (likely using Disable-NetAdapterRss)
So far, we haven’t tried either of these options yet. Both present their own set of problems. The CheckForHangTOInSeconds setting has a default of 4. So, changing it to 132 seems extremely high. There is no guidance that I can find on what an acceptable range is, so I’m mildly afraid to go there just yet. And, disabling RSS could be troublesome since we make use of RDMA using RoCEv2, and the performance penalty could also be too high.
We have recently just updated our drivers and firmware to the latest released late last month. If this issue comes back, Dell will be escalating it to Microsoft. Considering what I have found, I am starting to wonder if this isn’t a problem in Server 2022, specifically with RSS. It may just be that the NIC issues we’re seeing is how the driver is handling what Microsoft should be completing in a timely fashion, but clearly isn’t.
Again, let me know if you guys see those same 362, mlx5 events and what information they contain. Also, please let me know if any of this is helpful and what actions you guys decide take, if any, and how it goes. I’ll continue to update here with whatever new information I come across and how things go with Dell and/or Microsoft.