RDMA latency stability and PCIE version recognize

We are testing the performance of RDMA using ConnectX-5 25G on windows 10 Enterprise.
we test the NetworkDirectSPI write operation latency with data length of 1024Byte.
ref: GitHub - microsoft/NetworkDirect: NetworkDirect Service Provider Interface
And both server and client are on the same PC.
the average latency is around 50 microseconds, but sometimes the latency get more than 100ms.
My question is:

  1. Why these high latency data happened, is there any settings of configs to avoid this.

  2. Durning the test we found that the PCIE version changes when running “mlx5cmd -stat” after the PC restarted. the PCIE hardware version is 2, but sometimes the command shows PCIE gen1, and the speed also affected. why the this happens and how to avoid?

ref test data and PCIE version:

1 Like

Hello Ramadevi,

Welcome, and thank you for posting your inquiry to the NVIDIA community!

Given the fact that the PCIe link speed varies across reboots, the integrity of the PCIe link is called into question. When the PCIe link is unable to train at the optimal link speed/width, 3 scenarios are most likely:

a) The adapter is not seated properly.
b) There’s a hardware issue with the adapter.
c) There’s a hardware issue with the slot (mainboard).

As the integrity of the PCIe link itself is unknown, this needs to be rectified before performance tuning / troubleshooting can be performed.

If reseating the adapter does not resolve the sporadic link speed degradation, a swap to another slot is recommended.

If the same behavior is encountered in another slot, swap with a known good adapter is recommended.

If the same behavior is encountered on this system with a known good adapter and/or in a different slot, then we recommend engaging your hardware vendor to assess next steps with regards to the mainboard hardware.

Once the PCIe link is validated, we have several tuning recommendations in the ‘Troubleshooting’ section of the WinOF-2 User Manual >> https://docs.nvidia.com/networking/display/winof2v320/Troubleshooting . Relevant sections here would be ‘Ethernet Related Troubleshooting’ and ‘Performance Related Troubleshooting’.

If you are unable to achieve stable performance after these steps have been followed, and you have valid support entitlement, we recommend opening a support ticket with our Enterprise Support team via the NVIDIA Enterprise Experience Support Portal: https://enterprise-support.nvidia.com/s/create-case . Our engineers will be able to assist you with determining the root cause of this degradation.

Thanks, and best regards,
NVIDIA Enterprise Experience

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.