Hello,
I’m currently attempting to implement a basic ROCE v1 configuration between two identical servers that will eventually be used for a storage cluster.
Currently my setup is as follows:
Server1 - Windows Server 2019 - Mellanox/NVidia X5 NIC - Model MCX512F - driver 24.7.26520.0 - firmware 16.35.3006
Server2 - Windows Server 2019 - Mellanox/Nvidia X5 NIC - Model MCX512F - driver 24.7.26520.0 - firmware 16.35.3006
These two servers are going through an SFP28 fiber connection to a fiber switch that has a corresponding ROCE configuration in place.
Each adapter is IP’d as follows:
Server1:
X5P1 - 10.11.15.15
X5P2 - 10.11.16.15
Server2:
X5P1 - 10.11.15.16
X5P2 - 10.11.16.16
I’ve enabled all required NetQosPolicy powershell commands based on what I believe will be our eventual RDMA needs.
I’m able to successfully test RDMA / ROCE traffic with a Microsoft provided Test-Rdma.ps1 script that seemingly is able to transfer all test data workloads without any issues. I’m also able to copy large files between the two servers at great speed.
One item that I’ve noticed that has been recurring in my event logs after the transfer completes is as follows:
"RDMA connection disconnected.
Transport name: \Device\RdmaSmbIpv4_10.11.15.16
Milliseconds spent closing the connection: 0
Guidance:
Closing an RDMA connection should not take longer than 2 minutes. An RDMA IO that takes an abnormally long time to complete indicates a problem with the RDMA network adapters on this computer or its remote host. Contact your RDMA vendor for an updated driver and further troubleshooting."
The event ID is 1043
the event source is SMBServer
My question is this: Is this message expected or is this indicating that there is a problem with my ROCE configuration somehow? Has anyone else had experience with this in similar situations or configurations?
Any guidance here would be appreciated.